Submitted by Angry_Grandpa_ t3_y92cl1 in singularity

https://preview.redd.it/qq7cl89iizu91.jpg?width=1478&format=pjpg&auto=webp&s=51c121595ef50f0ce6dcfc8f70d16e8f3ae27437

If we want to build an AI model based on everything ever said on YouTube then according to my calculations a model with 770 billion parameters and 15.7 trillion tokens would be sufficient.

This assumes all of the audio is converted to text and an average speech speed of 100 words per minute. 500 hours of content is uploaded to YouTube every minute. It’s probably larger than what we’d need since I assumed 10 years at 500 hours and YouTube didn’t hit 500 hours per minute until 2019.

1,000 words is approximately 750 tokens. So it’s possible the token number needs to be a little higher (19.6 trillion).

This is based on the optimized parameter to token ratio in the Chincilla model.

Source: https://arxiv.org/abs/2203.15556

Based on the public pricing on Mosaic cloud the current cost is $2.5 million for a 1.4 trillion token model. So the worst case scenario would be a mere $35 million for a YouTube large language model.

Source: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Is it worth it?

Has Google already done it?

56

Comments

You must log in or register to comment.

manOnPavementWaving t1_it393xz wrote

my man you cant just scale cost with number of tokens and not number of parameters

way too many mostly false assumptions in these calculations

38

Nmanga90 t1_it3brwl wrote

In addition to what the other guy said, very bold to assume google has not been actively doing this for years

9

ReasonablyBadass t1_it3dvxo wrote

I mean, why? We already have large text corpora. The whole point of youtube is visual data, no?

13

newDeckardCain t1_it3ihks wrote

This is interesting something that stability.ai should do. A further interesting iteration of this would be to associate an image i.e. the current frame in the video to the token and maybe that prompts the model to also have a world model.

Like what Yan LeCun has been advocating for.

4

ScionoftheToad t1_it3ofwm wrote

An AI trained off of Youtube comments would be one of the most toxic things imaginable.

10

visarga t1_it4lygh wrote

Visual data can be described in text, and maybe it's better to do so in order to avoid overfitting to irrelevant details. We have great captioning models for image and video, so we can use them together with speech recognition models. Just imagine a model trained on YT videos playing the sports commentator role - wouldn't it be great to have a virtual commentator for your vids?

But I am excited about training on massive video because it is special - it contains a trove of procedural knowledge, how to do things, step by step. That means you can finetune it later to automate anything you want. Your clumsy robot just got GPT-3 level smarts in practical tasks rarely described in words anywhere.

There was a recent paper - with just 5 hours of robot video and proprioception they trained a transformer to manipulate a toy kitchen and achieve tasks. Pretty amazing, considering The Wozniak threshold of AI: a robot enters a random kitchen and has to make a cup of coffee. There are millions of kitchens on YT, millions of everything in fact.

Looks like "learning to act" is going to be very successful, just like learning to generate text and images. Maybe the handymen won't be the last to be automated.

5

visarga t1_it4q1uj wrote

Train a comment filter. Some comments are great, it depends on the topic very much. In fact, scrap that! Do a GPT4chan and train on the real YT comments. Then instruction-tune the model to be polite. Better to be a polite model but know all the shitty stuff too, to get the jokes.

4

visarga t1_it4qoq7 wrote

After text, image and video (+ audio) I think we got all the bases covered. Nobody can claim AI is not grounded anymore. And with this grounding comes a nuanced, semantic understanding of the world. It's like an upload, but not of a person, the whole culture gets to be uploaded at once.

3

manOnPavementWaving t1_it4r8la wrote

I have read the paper, which is how I know that they scale data and parameters equally, meaning a 10x in data results in a 100x in compute required and hence a 100x in cost.

Assumptions wise Im looking more at the number of words on youtube, your estimate is likely wildly off.

Youre also ignoring that the training time could very well be long enough that it would be a better strategy to wait for better GPUs to come out.

12

jotmoney t1_it5lqvb wrote

What’s the estimate of errors on translation

2

Dreason8 t1_it5xes0 wrote

90% of the data: "don't forget to like, subscribe, and smash that notification button"

6

LeroyJanky80 t1_it6g9ub wrote

The data Google/Alphabet has is obviously it's most powerful asset. My guess is they've done this, have the means, brain trust, wealth and capacity to do this. They can easily cover this in all domains where people, infrastructure and content are concerned. It's a massive endeavour but so is what they did with the entire internet many many years ago and at the time it was groundbreaking.

2

BinyaminDelta t1_it9ci16 wrote

Pretty sure all or most of the audio on YouTube has ALREADY been transcribed.

1