Submitted by visarga t3_zsmjhe in singularity
The stock of high-quality language data will be exhausted soon, likely before 2026. By contrast, the stock of low quality language data and image data will be exhausted only much later; between 2030 and 2050.
Paper: Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning, it is pretty easy to read, no math. This post is supporting the one from 2 days ago about generating datasets for language models.
visarga OP t1_j18n1i1 wrote
An interesting fact, the current dataset size is 1T words. All the skills of language models come from this one TeraWord. We can get 10 TWords after we finish scraping everything, after that it depends on finding other sources. Speech data is 10,000 TWords though.