Viewing a single comment thread. View all comments

harharveryfunny t1_jdm3bm4 wrote

It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.

To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.

https://arxiv.org/abs/2203.15556

2

alrunan t1_jdmbv4k wrote

The chinchilla scaling laws is just used to calculate the optimal scale for dataset and model size for a particular training budget.

You should read the LLaMA paper.

3

harharveryfunny t1_jdmd38s wrote

>You should read the LLaMA paper.

OK - will do. What specifically did you find interesting (related to scaling or not) ?

1

alrunan t1_jdmm3lw wrote

The 7B model is trained on 1T tokens and performs really well for its number of parameters.

3