Viewing a single comment thread. View all comments

Competitive-Rub-1958 t1_irwn68x wrote

I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)

As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.

You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set.
But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.

And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)

2