Viewing a single comment thread. View all comments

Gohoyo t1_j57azvh wrote

> It's not like the internet is suddenly going to 10x in size over the next couple of years. Especially as the global population is shrinking and most people are already connected online so not a lot of new data is made.

I don't get this. Can't AI generate more data for itself in like, a year, than all human communications since the dawn of the internet? Why would the internet need to 10x in size if the population gets a hold of AI that increases the amount of content generated by x1000? Seems like you just need an AI that generates a fuck ton of content and then another one that determines what in that content is "quality". I am totally ignorant here, I just find the 'running out of data' thing quite strange.

2

genshiryoku t1_j57bbc9 wrote

You can't use AI generated data to train AI as essentially they are already from their dataset. Training with synthetic data like that is called "overfitting" and reduces the performance and effectiveness of the AI.

3

Gohoyo t1_j57c8yb wrote

Does this mean it only learns from novel information it takes in? As in it can never learn anything about cat conversations after the 10th conversation it reads about a cat? I mean what's the difference between it reading about something it made versus reading someone a person wrote that says something similar? I just can't figure out how you can't get around this by using AI somehow.

Like: AI A makes a billion terabytes of content.

AI B takes in content and makes it 'unique/new/special' somehow.

Give it back to AI A or even a new AI C.

1

genshiryoku t1_j57dtsz wrote

Without going to deep into it. This is a symptom of Transformer models. My argument was why transformer models like GPT can't scale up.

It has to do with the mathematics behind training AI. Essentially for every piece of data the AI refines itself but for copies of data it overcorrects itself which results in inefficiency or worse performance. With synthetic data it kinda acts the same as duplicate data in that it overcorrects and worsens its own performance.

If you are truly interested you can see for yourself here.

And yes AI researchers are looking for models to detect what data is synthetic on the internet because it's inevitable that new data will be machine generated which can't be used to train on. If we fail at that task we might even enter an "AI dark age" where models get worse and worse with time because the internet will be filled with AI generated garbage data that can't be trained on. Which is the worst case scenario.

4

Gohoyo t1_j57fu2a wrote

Thanks for trying to help me btw.

I watched the video. I can understand why reading it's own data wouldn't work, but I can't understand why having it create a bunch of data and then altering the data, then giving it back to the AI wouldn't. The key here is that we have machines that can create data at super human speeds. There has to be some way to do something with that data to make it useful to the AI again, right?

1

genshiryoku t1_j57h1fb wrote

The "created data" is merely the AI mixing the training data in such a way that it "creates" something new. If the dataset is big enough this looks amazing and like the AI is actually creative and creating new things but from a mathematics perspective it's still just statistically somewhere in between the data it already has trained on.

Therefor it would be the same as feeding it its own data. To us it seems like completely new, and actually useable data though which is why ChatGPT is so exciting. But for AI training purposes it's useless.

1

Gohoyo t1_j57hihv wrote

If ChatGPT creates a paragraph, I then take that paragraph and alter it significantly, how is that new never before seen by AI or humans paragraph not new data for the AI?

2

genshiryoku t1_j57j6s1 wrote

It would be lower quality data but still usable if significantly altered. The question is. Why would you do this instead of just generating real data?

GPT is trained on human language it needs real interaction to learn from like the one we're having right now.

I'm also not saying that this isn't possible. We are AGI level intelligences and we absolutely consumed less data than GPT-3 did over our lifetimes so we know it's possible to reach AGI with relatively little data.

My original argument was merely that it's impossible with current transformer models like GPT and that we need another breakthrough in AI architecture to solve problems like this, not merely scale up current transformer models, because the training data is going to run out over the next couple of years as all of the internet will be used up.

0

Gohoyo t1_j57jyq4 wrote

> Why would you do this instead of just generating real data?

The idea would be that harnessing the AI's ability to create massive amounts of regurgitated old data quickly and then transmuting it into 'new data' somehow is faster than acquiring real data.

I mean I believe you, I'm not in this field nor a genius, so if the top AI people are seeing it as a problem then I have to assume it really is, I just don't understand it fully.

1