Viewing a single comment thread. View all comments

yldedly t1_ir5n8r8 wrote

>we get better results^(*)

^(*)results on out-of-distribution data sold separately

8

Optional_Joystick t1_ir5oz6d wrote

I'm not sure what \) means, but totally agree data is also a bottleneck. Imagine if the system could also seek out data on its own that isn't totally random noise, and yet isn't fully understood by the model.

2

yldedly t1_ir6reto wrote

Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.

3

Optional_Joystick t1_ir72t5g wrote

Really appreciate this. I was excited enough about learning knowledge distillation was a thing. I felt we had the method of extracting the useful single rule from the larger model.

On the interpolation/extrapolation piece: For certain functions like x^2, wouldn't running the result of a function through the function again let you achieve a result that "extrapolates" a new result outside the existing data set? This is kind of my position on why I feel feeding LLM data generated from an LLM can result in something new.

It's still not clear to me how we can verify a model's performance if we don't have data to test it on. I'll have to read more about DreamCoder. As much as I wish I could work in the field, it looks like I've still got a lot to learn.

2

Competitive-Rub-1958 t1_ir7f6mk wrote

Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).

The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...

2

yldedly t1_ir8ya0q wrote

LEGO paper?

2

Competitive-Rub-1958 t1_irai5ot wrote

2

yldedly t1_iraslo4 wrote

Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.

2

Competitive-Rub-1958 t1_iret615 wrote

It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)

2

yldedly t1_irhr4lg wrote

If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.

2

Competitive-Rub-1958 t1_irj4p6e wrote

It's rather a trend they're trying to study and explain. It appears, as you scale models and bootstrap from pre-trained variations, you learn plenty of useful priors. this is quite crucial for LLMs which are able to solve many tasks which may not be explictly in their distribution, but are able to muddle their way along much better rather than being pre-trained from scratch. In that sense, transfer learning is much more about transferring priors than knowledge.

LLMs like Chinchilla and PaLM best demonstrate that, I suppose. PaLM was trained with 95% of that data being Social Media (which is 50% alone) and miscellaneous topics, only 5% being the GitHub subset. Yet with 50X less code in its dataset, its able to pull up to Codex.

This may hint towards larger models learning more general priors applicable on a variety of tasks, and this trend being highly correlated with scale. So, I think the hope is that as you scale up the priors these models learn the underlying function better rather than just shortcut learning their way. A good demonstration would've been fine-tuning GPT3 with a sizeable chunk of the LEGO dataset and checking if it has higher generalizability on those tasks.

2

yldedly t1_irrrqyi wrote

You've shifted my view closer to yours. What you say about pretraining and priors makes a lot of sense. But I still think shortcut learning is a fundamental problem irrespective of scale - it becomes less of a problem with scale, but not quickly enough. For modern ml engineering, pretraining is a boon, but for engineering general intelligence, I think we need stronger generalization than is possible without code as representations and causal inference.

2

Competitive-Rub-1958 t1_irs8vxl wrote

Even in the context of AGI, humans also carry many priors - most of them embedded in the DNA pertaining to the fundamental "blueprint" of a cortical column.
It appears that instead of evolution, natural selection and mutation if we can learn those same priors faster and more efficiently that natural selection with gradient based methods.

https://twitter.com/gruver_nate/status/1578386103417069569 is a twitter summary describing how the transformer learns positional equivariance in the scope of their dataset. This is quite a complex prior, and is present in convolutions implicitly.

It makes sense to collate all our findings, and think that with scale those priors simply become more general - hence why we obtain such massive performance boosts which are also predictable and haven't yet stopped progress (530B is a number thrown around everywhere, but people don't realize the insane amount of compute and work which went into it. It's absolutely humongous for any system to scale to that size, let alone still be able to beat benchmarks)

I feel there are still more general priors we could embed in these models to make them more parameter efficient. But it is clear that DL is still currently the most viable route towards AGI as of now.

2

yldedly t1_irvfafm wrote

There's a lot to unpack here. I agree that a large part of creating AGI is building in the right priors ("learning priors" is a bit of an oxymoron imo, since a prior is exactly the part you don't learn, but it makes sense that a posterior for a pre-trained model is a prior for a fine-tuned model).

Invariance and equivariance are a great example. Expressed mathematically, using symbols, it makes no sense to say a model is more or less equivariant - it either is or it isn't. If you explicitly build equivariance into a model (and apparently it's not as straightforward as e.g. just using convolutions), then this is really what you get. For example, the handwriting model from my blogpost has real translational equivariance (because the location of a character is sampled).

If you instead learn the equivariance, you will only ever learn a shortcut - something that works on training and test data, but not universally, as the paper from the twitter thread shows. Just like the networks that can solve the LEGO task for 6 variables don't generalize to any number of variables, learning "equivariance" on one dataset (even if it's a huge one) doesn't guarantee equivariance on another. A neural network can't represent an algorithm like "for all variables, do x", or constraints like "f(g(x)) = g(f(x)), for all x" - you can't represent universal quantifiers using finite dimensional vectors.

That being said, you can definitely learn some useful priors by training very large networks on very large data. An architecture like the Transformer allows for some very general-purpose priors, like "do something for pairs of tokens 4 tokens apart".

2

Competitive-Rub-1958 t1_irwn68x wrote

I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)

As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.

You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set.
But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.

And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)

2