Submitted by fedegarzar t3_z9vbw7 in MachineLearning

​

https://preview.redd.it/c59sra8nwb3a1.png?width=1190&format=png&auto=webp&s=80b3f1a83d190ac0349ec97908aa806aaa03abc3

Machine learning progress is plagued by the conflict between competing ideas, with no shortage of failed reviews, underdelivering models, and failed investments in expensive over-engineered solutions.

We don't subscribe the Deep Learning hype for time series and present a fully reproducible experiment that shows that:

  1. A simple statistical ensemble outperforms most individual deep-learning models.
  2. A simple statistical ensemble is 25,000 faster and only slightly less accurate than an ensemble of deep learning models.

In other words, deep-learning ensembles outperform statistical ensembles just by 0.36 points in SMAPE. However, the DL ensemble takes more than 14 days to run and costs around USD 11,000, while the statistical ensemble takes 6 minutes to run and costs $0.5c.

For the 3,003 series of M3, these are the results.

https://preview.redd.it/89bhlcg9wb3a1.png?width=1678&format=png&auto=webp&s=e5471331b081142ba201b81ba3346a890d474c50

In conclusion: in terms of speed, costs, simplicity and interpretability, deep learning is far behind the simple statistical ensemble. In terms of accuracy, they are rather close.

You can read the full report and reproduce the experiments in this Github repo: https://github.com/Nixtla/statsforecast/tree/main/experiments/m3

304

Comments

You must log in or register to comment.

picardythird t1_iyj138u wrote

IIRC there was a recent paper that demonstrated how almost all deep learning approaches for time series forecasting use flawed evaluation procedures, resulting in misleading claims about performance and terrible out-of-distribution performance.

113

whatsafrigger t1_iyjaub9 wrote

It's so so so important to set up good experiments with solid baselines and comparisons to other methods.

37

notdelet t1_iyjhvqd wrote

If you use a flawed evaluation procedure, does a solid baseline do you any good?

17

Ulfgardleo t1_iylr164 wrote

The "and" in the post you replied to was a logical "and". The best evaluation procedure does not help if you use poor, underperforming baselines.

2

csreid t1_iykq7xn wrote

And it's sometimes kinda hard to realize you're doing a bad job, especially if your bunk experiments give good results

I didn't have a ton of guidance when I was writing my thesis (so, my first actual research work) and was so disheartened when I realized my excellent groundbreaking results were actually just from bad experimental setup.

Still published tho! ^^jk

13

Pikalima t1_iylah8s wrote

Sometimes I consider retracting my very first paper because of this.

6

maxToTheJ t1_iyjw8b8 wrote

A lot of people are doing hyperopt with how they set up their experiment to get better results so that they get in prestigious conferences.

6

kraegarthegreat t1_iyolbmo wrote

This PLAGUES my research.

The amount of detail that most papers provide about their statistical methods used as a baseline is not enough to replicate. "We outperformed ARIMA". Didn't provide values, etc.

1

uoftsuxalot t1_iyjrkll wrote

I would say forecasting in general is bs.

−6

ragamufin t1_iyl761r wrote

I’ve been doing it for a decade+ and I’m inclined to agree but it pays well and there’s no shortage of buyers. Even straight up named a model GIPSy once with a crystal ball logo, had a pretty good run.

8

uoftsuxalot t1_iyshw3h wrote

Lol, I'm minus 7 and you're positive 7 karma yet agreeing 😂. Reddit is so stupid sometimes

5

No-Yogurtcloset-6838 t1_iyjcar2 wrote

I will stick to my Exponential Smoothing good old Boomer technology.

The obvious implication of publish or perish mentality is that you cannot trust papers anymore, given all the hastily produced and broken Deep Learning conference methods.

69

StefaniaLVS t1_iyjtxji wrote

Hahaha hundreds of days burning GPUs. One can only start to suspect that the purpose of the conferences and deep learning literature is to promote GPU usage rather than improve the forecasting methods knowledge.

💵💵🤖💵💵

39

obsquire t1_iyjgih0 wrote

But those conference papered are Peer Reviewed (TM), the gold standard of those who Believe Science, and hence beyond reproach. You are hereby cancelled.

24

CyberPun-K t1_iyj4snj wrote

The M3 dataset consists only of 3,003 series, a minimal improvement of DL is not a surprise. Everybody knows that neural networks require large datasets to show substantial improvements over statistical baselines.

What is truly surprising is the time it takes to train the networks, 13 days for thousand series

=> there must be something broken with the experiments

44

HateRedditCantQuitit t1_iyj6yb6 wrote

14 days is 20k minutes, so it’s about 6.7 minutes per time series. I don’t know how many models are in the ensemble, but let’s assume it’s 13 models for even math, making an average deep model take 30s to train on an average time series.

Is that so crazy?

13

CyberPun-K t1_iyj7gb9 wrote

All the models are global models, trained using cross learning. Not single models per series. Unless the experiments were done like that.

19

I_LOVE_SOURCES t1_iykxl0a wrote

…. am i failing to detect humour/sarcasm? those words don’t appear to say anything

−1

BrisklyBrusque t1_iyj6bja wrote

13 days to tune multiple deep neural networks is not at all unrealistic depending on the number of gpus.

7

CyberPun-K t1_iyj6q2r wrote

NBEATs hyper-parameters are minimally explored in the original paper the ensemble was not tuned. There is something broken with the reported times.

17

Historical_Ad2338 t1_iylgux6 wrote

I was thinking the same thing when I looked into this. I'm not sure if the experiments are necessarily 'broken' (there may be at least reasonable justification for why it took 13 days to train), but the first point about dataset size is a smoking gun.

4

__mantissa__ t1_iylhzuj wrote

I have not read the paper yet, but the time DL ensemble takes may be due to some kind of hyperparameter search

4

cristianic18 t1_iyixd70 wrote

Also, how would someone know this particular combination of stats methods in the ensemble will produce good results beforehand?

43

SherbertTiny2366 t1_iyj00d2 wrote

>This ensemble is formed by averaging four statistical models: AutoARIMA, ETS, CES and DynamicOptimizedTheta. This combination won sixth place and was the simplest ensemble among the top 10 performers in the M4 competition.

63

TheBrain85 t1_iymw874 wrote

Pretty biased selection method: the best ensemble in the M4 competition, evaluated on the M3 competition. Although I'm not familiar with these datasets, they're from the same author, so presumably they have significant overlap and similarity. The real question is how hard is it to find such an ensemble without overfitting to the dataset.

3

SherbertTiny2366 t1_iynjxon wrote

How is it biased to try good-performing ensembles in another data set?

And how is that overfitting?

Furthermore, just because the data sets begin with "M" it does not mean that they "have significant overlap and similarity. "

0

TheBrain85 t1_iyp1qrz wrote

Because if there's overlap in the datasets, or they contain similar data, the exact ensemble you use is essentially an optimized hyperparameter specific for the dataset. It is exactly the reason that for any hyperparameter optimization cross-validation is used on a set separate from the test set. So using the results on the M4 dataset is akin to optimizing hyperparameters on the test set, which is a form of overfitting.

The datasets are from the same author, same series of competitions: https://en.wikipedia.org/wiki/Makridakis_Competitions#Fourth_competition,_started_on_January_1,_2018,_ended_on_May_31,_2018

"The M4 extended and replicated the results of the previous three competitions"

3

WikiSummarizerBot t1_iyp1s69 wrote

Makridakis Competitions

Fourth competition, started on January 1, 2018, ended on May 31, 2018

>The fourth competition, M4, was announced in November 2017. The competition started on January 1, 2018 and ended on May 31, 2018. Initial results were published in the International Journal of Forecasting on June 21, 2018. The M4 extended and replicated the results of the previous three competitions, using an extended and diverse set of time series to identify the most accurate forecasting method(s) for different types of predictions.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

2

SherbertTiny2366 t1_iyq6prw wrote

There is no overlap at all. It’s a completely new dataset. There might be similarities in the sense that there are time series or certain frequencies but in no way could it be the talk of “training in the test” set.

1

Puzzleheaded_Pin_379 t1_iylb596 wrote

In practice you don’t, but combiniation forecast still works. This is like saying, “how did someone know that the Total Stock Market Index would outperform bitcoin beforehand”. Combining forecast has been studied in the literature and in practice. It is effective.

6

dataslacker t1_iyjblnp wrote

I’m going to read this paper in detail but I’m wondering if there’s any insight into why DL methods underperform in TS prediction?

26

marr75 t1_iyjvtdc wrote

Just guessing here, but: overfitting.

32

Internal-Diet-514 t1_iykhg3s wrote

I think so too, I’m confused why they would need to train for 14 days, from skimming the paper it doesn’t seem like the dataset itself is that large. I bet a DL solution that was parameterized correctly to the problem would outperform the traditional statistical approaches.

19

marr75 t1_iykwulm wrote

While I agree with your general statement, my gut says a well parameterized/regularized deep learning solution would perform as well as an ensemble of statistical approaches (without the expertise needed to select the statistical approaches) but would be harder to explain/interpret.

15

TheDrownedKraken t1_iyko6jf wrote

I’m just curious, why do you think that?

3

Internal-Diet-514 t1_iymjci2 wrote

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

2

kraegarthegreat t1_iyor5g6 wrote

This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.

2

TropicalAudio t1_iylsprn wrote

Little need to speculate in this case: they're trying to fit giant models on a dataset that's a fraction of a megabyte, without any targeted pretraining or prior. That's like trying to prove trains are slower than running humans by having the two compete in a 100m race from standstill. The biggest set (monthly observations) is around 105kB of data. If anyone is surprised your average 10GB+ network doesn't perform very well there, well... I suppose now you know.

7

marr75 t1_iymo8k3 wrote

Yeah

> Just guessing here, but

is a common US English idiom that typically means, "Obviously".

You're absolutely right, though. Just by comparing the training data to the training process and serialized weights, you can see how clearly this should overfit. Once your model is noticeably bigger than a dictionary of X, Y pairs of all of your training data, it's very hard to avoid overfitting.

I volunteer with a group that develops interest and skills in science and tech for kids from historically excluded groups. I was teaching a lab on CV last month and my best student was like, "What if I train for 20 epochs, tho? What about 30?" and the performance improved (but didn't generalize as well). He didn't understand generalization yet so instead, he looked at the improvement trend and had a lightbulb moment and was like, "What if I train for 10,000 epochs???" I should check to see if his name is on the list of collaborators for the paper 😂

3

psyyduck t1_iykfb3f wrote

My guess is it’s the same reason we don’t have self-driving cars: bad out of distribution performance. Teslas get confused when they see new leaves where they’ve never been seen before. In the real world, distributions change a lot over time.

8

TrueBirch t1_iymehou wrote

In addition to what other people have said, I'll add this: classical methods work really well. In fields like text and image generation, we didn't have great approaches 20 years ago, and DL models represented a massive improvement.

1

ThePhantomPhoton t1_iyj7b4a wrote

Depends on the problem. For physical phenomena, statistical techniques are very effective. For more abstract applications, like language and vision, I just don’t know how the purely statistical methods could compete.

22

TotallyNotGunnar t1_iyjpbzs wrote

Even then. I dabble in image processing at work and haven't found a need for deep learning yet. Every time, there's some trick I can pull with a rule based classifier to address the business need. It's like Duck Hunt: why recognize ducks when you can scan for white vs. black pixels?

20

ThePhantomPhoton t1_iyk2c66 wrote

Upvoted because I agree with you-- for many simple image problems you can even just grayscale and use the distance from the Frobenius Norm of each class as input to a logistic regression and nail many of the cases.

8

TrueBirch t1_iymfk2r wrote

When I first read your comment, I thought you were still talking about Duck Hunt. I'd read the heck out of that whitepaper.

2

ragamufin t1_iyl7qwv wrote

Amen we’ve been doing satellite image time series analytics and deep learning keeps getting pushed off in favor of classification models based on complex features

4

bushrod t1_iyjxns1 wrote

The analysis relates to time series prediction problems. Isn't it fair to say vision and language do not fall under that umbrella?

17

mtocrat t1_iyk1n65 wrote

Consider spoken language, and you're back in the realm of time-series. Obviously simple statistical methods can't deal with those though.

13

bushrod t1_iyk33jc wrote

Right, even though language is a form of time series, in practice it doesn't use TSP methods. Transformers are not surprisingly being applied to TSP problems though.

6

Warhouse512 t1_iykw25k wrote

Eh, predicting where pedestrians are going, or predicting next frames in general. Even images have temporal forecasting use cases

3

ThePhantomPhoton t1_iyk2wnq wrote

I think you have a good argument for images, but language is more challenging because we rely on positional encodings (a kind of "time") to provide us with contextual clues which beat out the following form of statistical language model: Pr{x_{t+1}|x_0, x_1, ..., x_{t}} (Edit-- that is, predicting the next word in sequence given all preceding words in the sequence)

2

eeaxoe t1_iyn2zwu wrote

Tabular data is another problem setting where DL has a tough time stacking up to simpler statistical or even ML methods.

2

cristianic18 t1_iyixblu wrote

The results are interesting, but you should include more recent deep learning approaches (not only from GluonTS).

13

GreatBigBagOfNope t1_iyk3y0g wrote

Yes, DL is a sophisticated tool for the most intractable of tasks, and for most problems is like using the Death Star to crack a nut. This is well known and should just be a normal known thing any analyst of any flavour should have in mind - if you're using DL, especially anything not really big or that isn't natural language or image related, it should be for a good reason, because either a random forest, a GAM, or an autofit ARIMA will get you 80+% of the way there 80+% of the time for tabular data. Not everything needs to start with the biggest guns.

13

TrueBirch t1_iymerps wrote

>like using the Death Star to crack a nut

Or a sledgehammer.

I completely agree with you. I instruct the juniors where I work to start with the most basic possible statistical tests and add complexity only when necessary. A good-enough linear regression is easier to implement, replicate, and understand than a slightly-improved DL model.

2

michelin_chalupa t1_iykvrqx wrote

Isn’t it common knowledge that deep learning is usually not the best solution for modeling time series?

10

TrueBirch t1_iymfbbz wrote

Yes, but I've seen many proposals to apply DL to everyday problems where it's not well suited. Heck, even I briefly went down that rabbit hole with a graph theory problem at work. Tried out a basic greedy algorithm first and it worked well enough that I didn't see the need to get any more complicated.

1

SrPinko t1_iyjsraw wrote

I agree, for univariate timer series an statistical model should be enough in the most of the cases; however, I still thinks that DL models would outperform statistical models in multivariate time series with a big set of variables, like the MIMIC-III database. Am I wrong with this belief?

9

mtocrat t1_iyk1se1 wrote

Even for univariate time series, when you have the data & complexity, DL will obviously outperform simple methods. Show me the simple statistical method that can generate speech, a univariate time-series.

9

TrueBirch t1_iymf42w wrote

Wouldn't a DL model trained on a waveform just assume you were going to keep repeating the same words over and over?

1

mtocrat t1_iymi8i7 wrote

You could already tape together a deep learning solution consisting of neural speech recognition, an LLM and Wavenet. Counts as a deep learning solution in my book. I'm not sure if anyone has built an end-to-end solution and I expect it would be worse, but I'm sure if someone put their mind and money to it you'd get decent results

2

kraegarthegreat t1_iyorlrr wrote

From my personal experience:

- Univariate with a few timesteps: XGBoost or statistical methods.

- Multivariate with many timesteps: NN-based models.

4

TrueBirch t1_iymf0eo wrote

Depends how much data you have and how much signal there is. Separating signal from noise in a high-dimensional time series is always a challenge.

1

AceOfSpades0711 t1_iykea0o wrote

The current, rather excessive, employment of deep learning methods is majorly motivated by the desire to understand them better through the experience gained in applying them.

A good paper that puts this into perspective is from Lea Breiman called "Statistical Modeling: The Two Cultures". He argues in the paper that data based statistical models are preventing statisticians from new and exciting discoveries with algorithmic models. Coincidentally, the author is the creator of the ensemble idea that you are using here as explanation. Now take into account that this was written in 2001 where ensembles were what deep learning is in 2022.

Basically, deep learning is preferred in order to improve it to a point where it will by far outperform all other methods, which it is believed to have the potential for. For it may one day lead us to new and exciting discoveries.

3

bigb0w t1_iykxhz9 wrote

Essentially simple statistical models are much more eco-friendlier to the planet!

2

Sallao t1_iymcgzn wrote

Lost 9 months of this shit

2

TrueBirch t1_iymdxy3 wrote

Great writeup! Reminds me of the excellently named "Cracking nuts with a sledgehammer: when modern graph neural networks do worse than classical greedy algorithms" (https://arxiv.org/abs/2206.13211).

2

serge_cell t1_j05qcrj wrote

DL is not working well on low-dimentional samples data, data with low correlation between sample elements, and especially bad for time series prediction which is both. Many people put that kind of senseless projects (DL for time series) on their CV and that is instant black mark for candidate, at least for me. They say "but that approach did work!" I ask "did you try anything else?" "No".

1