visarga OP t1_j18n1i1 wrote on December 22, 2022 at 2:04 PM

An interesting fact, the current dataset size is 1T words. All the skills of language models come from this one TeraWord. We can get 10 TWords after we finish scraping everything, after that it depends on finding other sources. Speech data is 10,000 TWords though.