Viewing a single comment thread. View all comments

currentscurrents t1_j96pkvw wrote

> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.

Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.

In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.

7

buyIdris666 t1_j97eom6 wrote

Interesting! I didn't realize that

1

currentscurrents t1_j99iq9v wrote

Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.

But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.

2