Viewing a single comment thread. View all comments

yldedly t1_iraslo4 wrote

Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.

2

Competitive-Rub-1958 t1_iret615 wrote

It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)

2

yldedly t1_irhr4lg wrote

If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.

2

Competitive-Rub-1958 t1_irj4p6e wrote

It's rather a trend they're trying to study and explain. It appears, as you scale models and bootstrap from pre-trained variations, you learn plenty of useful priors. this is quite crucial for LLMs which are able to solve many tasks which may not be explictly in their distribution, but are able to muddle their way along much better rather than being pre-trained from scratch. In that sense, transfer learning is much more about transferring priors than knowledge.

LLMs like Chinchilla and PaLM best demonstrate that, I suppose. PaLM was trained with 95% of that data being Social Media (which is 50% alone) and miscellaneous topics, only 5% being the GitHub subset. Yet with 50X less code in its dataset, its able to pull up to Codex.

This may hint towards larger models learning more general priors applicable on a variety of tasks, and this trend being highly correlated with scale. So, I think the hope is that as you scale up the priors these models learn the underlying function better rather than just shortcut learning their way. A good demonstration would've been fine-tuning GPT3 with a sizeable chunk of the LEGO dataset and checking if it has higher generalizability on those tasks.

2

yldedly t1_irrrqyi wrote

You've shifted my view closer to yours. What you say about pretraining and priors makes a lot of sense. But I still think shortcut learning is a fundamental problem irrespective of scale - it becomes less of a problem with scale, but not quickly enough. For modern ml engineering, pretraining is a boon, but for engineering general intelligence, I think we need stronger generalization than is possible without code as representations and causal inference.

2

Competitive-Rub-1958 t1_irs8vxl wrote

Even in the context of AGI, humans also carry many priors - most of them embedded in the DNA pertaining to the fundamental "blueprint" of a cortical column.
It appears that instead of evolution, natural selection and mutation if we can learn those same priors faster and more efficiently that natural selection with gradient based methods.

https://twitter.com/gruver_nate/status/1578386103417069569 is a twitter summary describing how the transformer learns positional equivariance in the scope of their dataset. This is quite a complex prior, and is present in convolutions implicitly.

It makes sense to collate all our findings, and think that with scale those priors simply become more general - hence why we obtain such massive performance boosts which are also predictable and haven't yet stopped progress (530B is a number thrown around everywhere, but people don't realize the insane amount of compute and work which went into it. It's absolutely humongous for any system to scale to that size, let alone still be able to beat benchmarks)

I feel there are still more general priors we could embed in these models to make them more parameter efficient. But it is clear that DL is still currently the most viable route towards AGI as of now.

2

yldedly t1_irvfafm wrote

There's a lot to unpack here. I agree that a large part of creating AGI is building in the right priors ("learning priors" is a bit of an oxymoron imo, since a prior is exactly the part you don't learn, but it makes sense that a posterior for a pre-trained model is a prior for a fine-tuned model).

Invariance and equivariance are a great example. Expressed mathematically, using symbols, it makes no sense to say a model is more or less equivariant - it either is or it isn't. If you explicitly build equivariance into a model (and apparently it's not as straightforward as e.g. just using convolutions), then this is really what you get. For example, the handwriting model from my blogpost has real translational equivariance (because the location of a character is sampled).

If you instead learn the equivariance, you will only ever learn a shortcut - something that works on training and test data, but not universally, as the paper from the twitter thread shows. Just like the networks that can solve the LEGO task for 6 variables don't generalize to any number of variables, learning "equivariance" on one dataset (even if it's a huge one) doesn't guarantee equivariance on another. A neural network can't represent an algorithm like "for all variables, do x", or constraints like "f(g(x)) = g(f(x)), for all x" - you can't represent universal quantifiers using finite dimensional vectors.

That being said, you can definitely learn some useful priors by training very large networks on very large data. An architecture like the Transformer allows for some very general-purpose priors, like "do something for pairs of tokens 4 tokens apart".

2

Competitive-Rub-1958 t1_irwn68x wrote

I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)

As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.

You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set.
But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.

And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)

2