andreichiffa t1_j6c9xf1 wrote on January 29, 2023 at 8:28 AM

Reply to comment by visarga in Few questions about scalability of chatGPT [D] by besabestin

That’s a very bold claim that flies in the face of pretty much all the research on the subject to the date.

Surely you have extraordinary evidence to support such extraordinary claims?

visarga t1_j6n5mgc wrote on January 31, 2023 at 2:55 PM

Oh, yes, gladly. This "open"AI paper says it:

> Larger models are significantly more sample efficient, such that optimally compute efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

https://arxiv.org/abs/2001.08361

You can improve outcomes from small datasets by making the model larger.

andreichiffa t1_j6n9lg6 wrote on January 31, 2023 at 3:22 PM

A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805

About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf

Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf

Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.