Viewing a single comment thread. View all comments

buyIdris666 t1_j93m0ol wrote

Video will remain unsolved for a while.

LLM came first because the bit rate is lowest. A sentence of text is only a few hundred bits of information.

Now, image generation is getting good. It's still not perfect. The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

Video is even harder. 30 high res images a second. To make long, coherent, believable videos takes an enormous amount of data and processing power

5

currentscurrents t1_j96pkvw wrote

> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.

Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.

In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.

7

buyIdris666 t1_j97eom6 wrote

Interesting! I didn't realize that

1

currentscurrents t1_j99iq9v wrote

Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.

But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.

2