Submitted by fedegarzar t3_z9vbw7 in MachineLearning
Internal-Diet-514 t1_iymjci2 wrote
Reply to comment by TheDrownedKraken in [R] Statistical vs Deep Learning forecasting methods by fedegarzar
If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.
kraegarthegreat t1_iyor5g6 wrote
This is something I have found in my research. I keep seeing people making models with millions of parameters when I am able to achieve 99% of the performance with roughly 1k.
Viewing a single comment thread. View all comments