Submitted by Fabulous-Let-822 t3_115btl3 in MachineLearning
currentscurrents t1_j96pkvw wrote
Reply to comment by buyIdris666 in [D] what are some open problems in computer vision currently? by Fabulous-Let-822
> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.
That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.
Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.
In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.
buyIdris666 t1_j97eom6 wrote
Interesting! I didn't realize that
currentscurrents t1_j99iq9v wrote
Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.
But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.
Viewing a single comment thread. View all comments