The stock of high-quality language data will be exhausted soon, likely before 2026. By contrast, the stock of low quality language data and image data will be exhausted only much later; between 2030 and 2050.

Paper: Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning, it is pretty easy to read, no math. This post is supporting the one from 2 days ago about generating datasets for language models.

Comments

You must log in or register to comment.

visarga OP t1_j18n1i1 wrote on December 22, 2022 at 2:04 PM

An interesting fact, the current dataset size is 1T words. All the skills of language models come from this one TeraWord. We can get 10 TWords after we finish scraping everything, after that it depends on finding other sources. Speech data is 10,000 TWords though.

Imaginary_Ad307 t1_j196irp wrote on December 22, 2022 at 4:22 PM

LLM are inefficient, better algorithms are going to be developed.

turnip_burrito t1_j1ae3cw wrote on December 22, 2022 at 9:06 PM

We can always record more video data and extract text from the audio and images if needed. In that case we would need algorithms which require less data, or better hardware to process it.

Ne_Nel t1_j1a753h wrote on December 22, 2022 at 8:20 PM

Isn't AI supposed to be a big data generator? How could you be out of data with something that grow exponentially at make just that.

visarga OP t1_j1ad4ku wrote on December 22, 2022 at 8:59 PM

We're just beginning the journey of generating language data, until 2 years ago it was unthinkable. Today we have a bunch of generated datasets for math, code, multi-task tuning and behaviour based on rules. The trick is to validate whatever is generated.

visarga OP t1_j1dh3br wrote on December 23, 2022 at 2:25 PM

You can generate junk data, but it is hard to generate quality data. Human text is diverse and interesting. But in the last 1-2 years there are many teams generating data - math, code, diverse prompted tasks, and not generating just solutions, but sometimes also new problems and tests.

For example, it used to be necessary to label thousands of responses to tasks in order to train the human feedback model that is used to fine-tune GPT-3. So only OpenAI had a very good dataset, developed in-house. And for that reason GPT-3 ruled.

But more recently "Constitutional AI" will take a list of behavioural rules, the so called constitution, and using then will generate its own feedback data, and reach almost the same effect with human labeled feedback model. So it is automating AI alignment.

No_Ask_994 t1_j197hpr wrote on December 22, 2022 at 4:29 PM

Just ask gpt chat to create more text.

visarga OP t1_j1acwa8 wrote on December 22, 2022 at 8:58 PM

that's the plan, look in my last link

Worried_Knowledge88 t1_j1ob15s wrote on December 26, 2022 at 1:40 AM

That sounds like a plan!

OkMarionberry2531 t1_j19qb5x wrote on December 22, 2022 at 6:30 PM

Listen to people all around the world talking and train the algorithm.

Shelfrock77 t1_j1bouln wrote on December 23, 2022 at 2:51 AM

“You create data by thinking about it with neuralink” -Elon Musk

inquilinekea t1_j1es4kn wrote on December 23, 2022 at 7:49 PM

https://twitter.com/emollick/status/1605756428941246466