Submitted by visarga t3_zsmjhe in singularity

The stock of high-quality language data will be exhausted soon, likely before 2026. By contrast, the stock of low quality language data and image data will be exhausted only much later; between 2030 and 2050.

Paper: Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning, it is pretty easy to read, no math. This post is supporting the one from 2 days ago about generating datasets for language models.

6

Comments

You must log in or register to comment.

visarga OP t1_j18n1i1 wrote

An interesting fact, the current dataset size is 1T words. All the skills of language models come from this one TeraWord. We can get 10 TWords after we finish scraping everything, after that it depends on finding other sources. Speech data is 10,000 TWords though.

4

Imaginary_Ad307 t1_j196irp wrote

LLM are inefficient, better algorithms are going to be developed.

2

turnip_burrito t1_j1ae3cw wrote

We can always record more video data and extract text from the audio and images if needed. In that case we would need algorithms which require less data, or better hardware to process it.

3

Ne_Nel t1_j1a753h wrote

Isn't AI supposed to be a big data generator? How could you be out of data with something that grow exponentially at make just that.

2

visarga OP t1_j1ad4ku wrote

We're just beginning the journey of generating language data, until 2 years ago it was unthinkable. Today we have a bunch of generated datasets for math, code, multi-task tuning and behaviour based on rules. The trick is to validate whatever is generated.

2

visarga OP t1_j1dh3br wrote

You can generate junk data, but it is hard to generate quality data. Human text is diverse and interesting. But in the last 1-2 years there are many teams generating data - math, code, diverse prompted tasks, and not generating just solutions, but sometimes also new problems and tests.

For example, it used to be necessary to label thousands of responses to tasks in order to train the human feedback model that is used to fine-tune GPT-3. So only OpenAI had a very good dataset, developed in-house. And for that reason GPT-3 ruled.

But more recently "Constitutional AI" will take a list of behavioural rules, the so called constitution, and using then will generate its own feedback data, and reach almost the same effect with human labeled feedback model. So it is automating AI alignment.

1

OkMarionberry2531 t1_j19qb5x wrote

Listen to people all around the world talking and train the algorithm.

1

Shelfrock77 t1_j1bouln wrote

“You create data by thinking about it with neuralink” -Elon Musk

1