antonivs t1_je7ws1v wrote on March 30, 2023 at 1:28 AM

Reply to comment by SlowThePath in [D] FOMO on the rapid pace of LLMs by 00001746

I was referring to what the OpenAI GPT models are trained on. For GPT-3, that involved about 45 TB of text data, part of which was Common Crawl, a multi-petabyte corpus obtained from 8 years of web crawling.

On top of that, 16% of its corpus was books, totaling about 67 billion tokens.

SlowThePath t1_je7xmaz wrote on March 30, 2023 at 1:35 AM

Definitely not denying that it was trained on a massive amount of data because it was, but calling it internet sized is not accurate. I guess you were speaking in hyperbole and I juts didn't read it that way. I know what you mean.