Historical_Ad2338
Historical_Ad2338 t1_iujw7fv wrote
Reply to [D] When the GPU is NOT the bottleneck...? by alexnasla
LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.
Historical_Ad2338 t1_is8ru9y wrote
Reply to comment by [deleted] in [D] Manually creating the target data is considered as data leakage. by [deleted]
Yeah man... I really see no reason to use ML unless you have like a very good reason. Doesn't really seem like cheating though
Historical_Ad2338 t1_is3636c wrote
Genuinely shocking, Scaling laws for Neural Language Models figure 6 found that single layers weren't supposed to scale as well (with the same parameters) though ofc the fine details of this new paper are diff.
Historical_Ad2338 t1_iylgux6 wrote
Reply to comment by CyberPun-K in [R] Statistical vs Deep Learning forecasting methods by fedegarzar
I was thinking the same thing when I looked into this. I'm not sure if the experiments are necessarily 'broken' (there may be at least reasonable justification for why it took 13 days to train), but the first point about dataset size is a smoking gun.