currentscurrents t1_j96pkvw wrote on February 19, 2023 at 5:54 PM

Reply to comment by buyIdris666 in [D] what are some open problems in computer vision currently? by Fabulous-Let-822

> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.

Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.

In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.

buyIdris666 t1_j97eom6 wrote on February 19, 2023 at 8:48 PM

Interesting! I didn't realize that

currentscurrents t1_j99iq9v wrote on February 20, 2023 at 7:39 AM

Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.

But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.