Internal-Diet-514 t1_iymjci2 wrote on December 2, 2022 at 2:41 PM

Reply to comment by TheDrownedKraken in [R] Statistical vs Deep Learning forecasting methods by fedegarzar

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

kraegarthegreat t1_iyor5g6 wrote on December 2, 2022 at 11:53 PM

This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.