Historical_Ad2338

Historical_Ad2338 t1_iujw7fv wrote

LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.

5