Submitted by Singularian2501 t3_yx7zft in MachineLearning

Paper: https://arxiv.org/abs/2211.04325

Blog: https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

Abstract:

>We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.

Possible solutions based on the following papers:

https://arxiv.org/abs/2112.04426 , https://arxiv.org/abs/2111.00210 and https://openreview.net/forum?id=NiEtU7blzN / Retrival machanisms, EfficientZero and synthetic data can be seen as possible solutions that need to be improved on.

https://preview.redd.it/5tji6jd60e0a1.jpg?width=1559&format=pjpg&auto=webp&s=d7b5e5dbe6836fc0a59a17281cb7e2ea20e56727

https://preview.redd.it/qgsmjod60e0a1.jpg?width=1544&format=pjpg&auto=webp&s=d949c561f4a006791fecaf56bd155265b4580389

https://preview.redd.it/0zwq9ld60e0a1.jpg?width=1200&format=pjpg&auto=webp&s=808d578f3ac19ca4556830c21646d90132687918

53

Comments

You must log in or register to comment.

lostmsu t1_iwnoxf0 wrote

Have they mentioned Efficient Zero?

I think the author is severely behind of the current SOTA.

2

Singularian2501 OP t1_iwnpy8m wrote

Yes they mentioned it at the end of their blog article. But I think it was only meant as an example how better sample efficiency could be achieved and not SOTA related.

1

ktpr t1_iwode1v wrote

What’s wrong with self supervision? It enables combinatorial expansion of dataset sizes if the task is specified well.

10

londons_explorer t1_iwp5r0a wrote

There is a lot more data that could be used in the form of private communications (for example all iMessage chats), if only the ethical and legal side could be sorted out.

2

Singularian2501 OP t1_iwq1iph wrote

https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works

A lesswrong article I have found that explains how efficient zero works.

In my opinion the author wants to say that systems like efficient zero are more efficient in their data usage and could be used for llm also to increase their sample efficiency.

To be honest I hope that my post gets so much attention that the author of the paper can answer our questions.

3

leondz t1_ix96ivb wrote

We already did for most languages that aren't English. Data efficiency is the only way to catch up, for them.

2

leondz t1_ix96sfz wrote

Yeah, this gives you an idea of how little of the data is actually worth going through - most of it repeats structures found elsewhere in the data, and isn't very diverse. Going through huge low-curation datasets is inefficient: the data diversity just isn't there.

1

bloc97 t1_ixjuivv wrote

This can be considered good news. If all data is exhausted people will be actually forced to research better data-efficient algorithms. We humans don't ingest 100 GBs of arXiv papers to do research and we don't need billions of images to paint a cat sitting on a sofa. Until we figure out how to run GPT-3 on smartphones (maybe using neuromorphic computing?), we shouldn't be too worried about the trend of using bigger and bigger datasets, because small(er) networks can be successfully trained without that much data.

3