Submitted by LesleyFair t3_10fw22o in deeplearning
--dany-- t1_j4zx6lf wrote
Very good write up! Thanks for sharing your thoughts and observations. Some questions many other folks may have as well
- how do you arrive at the number it’s 500x smaller or 200 million parameters?
- Your estimate of 53 years for training a 100T model, can you elaborate how you got 53?
LesleyFair OP t1_j501xt6 wrote
First, thanks a lot for reading and thank you for the good questions:
A1) Current GPT-3 is 175B parameters. If GPT-4 would be 100T parameters, it would be a scale-up of roughly 500x.
A2) I got the calculation from the paper for the Turing NLG model. The total training time in seconds is reached by multiplying the number of tokens by the number of model parameters. That number is then divided by the number of GPUs times each GPU's FLOPs per second.
adubowski t1_j549298 wrote
- Is your assumption that GPT-4 will stay the same size as GPT-3?
Viewing a single comment thread. View all comments