Viewing a single comment thread. View all comments

antonivs t1_je7ws1v wrote

I was referring to what the OpenAI GPT models are trained on. For GPT-3, that involved about 45 TB of text data, part of which was Common Crawl, a multi-petabyte corpus obtained from 8 years of web crawling.

On top of that, 16% of its corpus was books, totaling about 67 billion tokens.

2

SlowThePath t1_je7xmaz wrote

Definitely not denying that it was trained on a massive amount of data because it was, but calling it internet sized is not accurate. I guess you were speaking in hyperbole and I juts didn't read it that way. I know what you mean.

1