masterofn1 t1_jdu8jug wrote on March 27, 2023 at 6:08 AM

How does a Transformer architecture handle inputs of different lengths? Is the sequence length limit inherent to the model architecture or more because of resource issues like memory?

Matthew2229 t1_jduyi8o wrote on March 27, 2023 at 11:57 AM

It's a memory issue. Since the attention matrix scales quadratically (N^2) with sequence length (N), we simply don't have enough memory for long sequences. Most of the development around transformers/attention has been targeting this specific problem.