Historical_Ad2338 t1_iylgux6 wrote on December 2, 2022 at 7:07 AM

Reply to comment by CyberPun-K in [R] Statistical vs Deep Learning forecasting methods by fedegarzar

I was thinking the same thing when I looked into this. I'm not sure if the experiments are necessarily 'broken' (there may be at least reasonable justification for why it took 13 days to train), but the first point about dataset size is a smoking gun.

Historical_Ad2338 t1_iujw7fv wrote on October 31, 2022 at 9:54 PM

Reply to [D] When the GPU is NOT the bottleneck...? by alexnasla

LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.

Historical_Ad2338 t1_is8ru9y wrote on October 14, 2022 at 3:04 AM

Reply to comment by [deleted] in [D] Manually creating the target data is considered as data leakage. by [deleted]

Yeah man... I really see no reason to use ML unless you have like a very good reason. Doesn't really seem like cheating though

Historical_Ad2338 t1_is3636c wrote on October 12, 2022 at 11:30 PM

Reply to [D] Wide Attention Is The Way Forward For Transformers by SuchOccasion457

Genuinely shocking, Scaling laws for Neural Language Models figure 6 found that single layers weren't supposed to scale as well (with the same parameters) though ofc the fine details of this new paper are diff.