Submitted by LesleyFair t3_10fw22o in deeplearning

​

Number Of Parameters GPT-3 vs. GPT-4

The rumor mill is buzzing around the release of GPT-4.

People are predicting the model will have 100 trillion parameters. That’s a trillion with a “t”.

The often-used graphic above makes GPT-3 look like a cute little breadcrumb that is about to have a live-ending encounter with a bowling ball.

Sure, OpenAI’s new brainchild will certainly be mind-bending and language models have been getting bigger — fast!

But this time might be different and it makes for a good opportunity to look at the research on scaling large language models (LLMs).

Let’s go!

Training 100 Trillion Parameters

The creation of GPT-3 was a marvelous feat of engineering. The training was done on 1024 GPUs, took 34 days, and cost $4.6M in compute alone [1].

Training a 100T parameter model on the same data, using 10000 GPUs, would take 53 Years. To avoid overfitting such a huge model the dataset would also need to be much(!) larger.

So, where is this rumor coming from?

The Source Of The Rumor:

It turns out OpenAI itself might be the source of it.

In August 2021 the CEO of Cerebras told wired: “From talking to OpenAI, GPT-4 will be about 100 trillion parameters”.

A the time, that was most likely what they believed, but that was in 2021. So, basically forever ago when machine learning research is concerned.

Things have changed a lot since then!

To understand what happened we first need to look at how people decide the number of parameters in a model.

Deciding The Number Of Parameters:

The enormous hunger for resources typically makes it feasible to train an LLM only once.

In practice, the available compute budget (how much money will be spent, available GPUs, etc.) is known in advance. Before the training is started, researchers need to accurately predict which hyperparameters will result in the best model.

But there’s a catch!

Most research on neural networks is empirical. People typically run hundreds or even thousands of training experiments until they find a good model with the right hyperparameters.

With LLMs we cannot do that. Training 200 GPT-3 models would set you back roughly a billion dollars. Not even the deep-pocketed tech giants can spend this sort of money.

Therefore, researchers need to work with what they have. Either they investigate the few big models that have been trained or they train smaller models in the hope of learning something about how to scale the big ones.

This process can very noisy and the community’s understanding has evolved a lot over the last few years.

What People Used To Think About Scaling LLMs

In 2020, a team of researchers from OpenAI released a paper called: “Scaling Laws For Neural Language Models”.

They observed a predictable decrease in training loss when increasing the model size over multiple orders of magnitude.

So far so good. But they made two other observations, which resulted in the model size ballooning rapidly.

  1. To scale models optimally the parameters should scale quicker than the dataset size. To be exact, their analysis showed when increasing the model size 8x the dataset only needs to be increased 5x.
  2. Full model convergence is not compute-efficient. Given a fixed compute budget it is better to train large models shorter than to use a smaller model and train it longer.

Hence, it seemed as if the way to improve performance was to scale models faster than the dataset size [2].

And that is what people did. The models got larger and larger with GPT-3 (175B), Gopher (280B), Megatron-Turing NLG (530B) just to name a few.

But the bigger models failed to deliver on the promise.

Read on to learn why!

What We know About Scaling Models Today

It turns out you need to scale training sets and models in equal proportions. So, every time the model size doubles, the number of training tokens should double as well.

This was published in DeepMind’s 2022 paper: “Training Compute-Optimal Large Language Models”

The researchers fitted over 400 language models ranging from 70M to over 16B parameters. To assess the impact of dataset size they also varied the number of training tokens from 5B-500B tokens.

The findings allowed them to estimate that a compute-optimal version of GPT-3 (175B) should be trained on roughly 3.7T tokens. That is more than 10x the data that the original model was trained on.

To verify their results they trained a fairly small model on vastly more data. Their model, called Chinchilla, has 70B parameters and is trained on 1.4T tokens. Hence it is 2.5x smaller than GPT-3 but trained on almost 5x the data.

Chinchilla outperforms GPT-3 and other much larger models by a fair margin [3].

This was a great breakthrough!The model is not just better, but its smaller size makes inference cheaper and finetuning easier.

So What Will Happen?

What GPT-4 Might Look Like:

To properly fit a model with 100T parameters, open OpenAI needs a dataset of roughly 700T tokens. Given 1M GPUs and using the calculus from above, it would still take roughly 2650 years to train the model [1].

So, here is what GPT-4 could look like:

  • Similar size to GPT-3, but trained optimally on 10x more data
  • Multi-modal outputting text, images, and sound
  • Output conditioned on document chunks from a memory bank that the model has access to during prediction [4]
  • Doubled context size allows longer predictions before the model starts going off the rails​

Regardless of the exact design, it will be a solid step forward. However, it will not be the 100T token human-brain-like AGI that people make it out to be.

Whatever it will look like, I am sure it will be amazing and we can all be excited about the release.

Such exciting times to be alive!

If you got down here, thank you! It was a privilege to make this for you. At TheDecoding ⭕, I send out a thoughtful newsletter about ML research and the data economy once a week. No Spam. No Nonsense. Click here to sign up!

References:

[1] D. Narayanan, M. Shoeybi, J. Casper , P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee , M. Zaharia, Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021), SC21

[2] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,… & D. Amodei, Scaling laws for neural language models (2020), arxiv preprint

[3] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L. Hendricks, J. Welbl, A. Clark, T. Hennigan, Training Compute-Optimal Large Language Models (2022). arXiv preprint arXiv:2203.15556.

[4] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. Driessche, J. Lespiau, B. Damoc, A. Clark, D. Casas, Improving language models by retrieving from trillions of tokens (2021). arXiv preprint arXiv:2112.04426.Vancouver

68

Comments

You must log in or register to comment.

--dany-- t1_j4zx6lf wrote

Very good write up! Thanks for sharing your thoughts and observations. Some questions many other folks may have as well

  1. how do you arrive at the number it’s 500x smaller or 200 million parameters?
  2. Your estimate of 53 years for training a 100T model, can you elaborate how you got 53?
10

LesleyFair OP t1_j501xt6 wrote

First, thanks a lot for reading and thank you for the good questions:

A1) Current GPT-3 is 175B parameters. If GPT-4 would be 100T parameters, it would be a scale-up of roughly 500x.

A2) I got the calculation from the paper for the Turing NLG model. The total training time in seconds is reached by multiplying the number of tokens by the number of model parameters. That number is then divided by the number of GPUs times each GPU's FLOPs per second.

8

sEi_ t1_j50efga wrote

About GPT-4 From the horses mouth:

Interview with Sam Altman (CEO OpenAI) from 2 days ago (17 jan).

Article in Verge:

>"OpenAI CEO Sam Altman on GPT-4: ‘people are begging to be disappointed and they will be’"

https://www.theverge.com/23560328/openai-gpt-4-rumor-release-date-sam-altman-interview

Video with the interview in 2 parts:

>StrictlyVC in conversation with Sam Altman

https://www.youtube.com/watch?v=57OU18cogJI&ab_channel=ConnieLoizos

6

JohnFatherJohn t1_j5161hj wrote

People will be disappointed because they don't understand the relationship between model complexity and performance. There's so many irresponsible and/or uneducated articles suggesting that the orders of magnitude increase in the number of parameters will translate to orders of magnitude performance gains, which is obviously wrong.

6

tehbuss_ t1_j516emc wrote

This was a really good discussion!

3

Blacky372 t1_j521lej wrote

I like your article, thank you for sharing.

But writing "no spam, no nonsense" is a little weird to me if I get this when trying to subscribe.

Don't get me wrong, it's fine to monetize your content and to use your followers data to present them personalized ads. Acting like you're just enthusiastic about sharing info at the same time doesn't really fit.

3

lol-its-funny t1_j55ge3m wrote

From 6 months back, also very useful for the future of scaling, time and (traditional) data limits.

https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications

Basically even the largest model to date, PaLM is very suboptimal by leaning towards WAY more parameters than it’s training data size. In fact there might not be enough data in the world today. Even with infinite data and infinite model sizes there are limits.

Check it out, very interesting compared to recent “more is more” trends.

2