When it comes to digital media file size, generally speaking text < images < audio < video. This seems to reflect the typical "information density" of each medium (alphanumeric vs. waveform vs. still image vs. moving image). Processing large amounts of text is lightning-fast, while video usually takes much longer because there's just more there there. Etc.

But in terms of AI media synthesis, the compute times seem really out of whack. A desktop PC with an older consumer graphics card can generate a high quality Stable Diffusion image in under a minute, but generating a 30-second AI Jukebox clip takes many hours on the best Colab-powered GPUs, while decent text-based LLMs are difficult-to-impossible to run locally. What factors explain the disparity? Can we expect the relative difficulty of generating text/audio/images/video to hew closer to what you'd expect as the systems are refined?

Comments

SuperSpaceEye t1_iwhjfk1 wrote on November 15, 2022 at 6:10 PM

Well, if you want to generate a coherent text you need a quite large model because you will easily find logical and writing errors as smaller models will give artifacts that will ruin the quality of output. The same with music as we are quite perceptive in small inaccuracies. Now images on the other hand can have "large" errors and still be beautiful to look at. Also, images can have large variations in textures, backgrounds, etc, making it easier for model to make "good enough" picture which won't work for text or audio, allowing for much smaller models.

Jordan117 OP t1_iwhs5cw wrote on November 15, 2022 at 7:05 PM

Is there a reason the language model part of image diffusion requires a lot less horsepower than running a language model by itself? I'm still amazed SD works quickly on my 2016-era PC, but apparently something like GPT-J requires dozens or hundreds of GB of memory to even store. Is it the difference between generating new text vs. working with existing text?

SuperSpaceEye t1_iwht6hf wrote on November 15, 2022 at 7:12 PM

Two different tasks. Language model in SD just encodes text to some abstract representation that diffusion part of the model then uses. Text-to-text model such as GPT-J does different task which is much harder. Also, GPT-J is 6B parameters, which will only take like 12GB or VRAM, not hundreds.

Jordan117 OP t1_iwhtnxu wrote on November 15, 2022 at 7:15 PM

Thanks for the clarification, I must have misread an older post talking about CPU memory requirements instead of GPU.

Zermelane t1_iwvefy4 wrote on November 18, 2022 at 5:18 PM

This question really deserves a great answer, and I've tried to write an okay one over two days now, but there's an intuition that I don't really know how to express. Or that might just be wrong, I'm not a ML researcher. But even if it's just disjointed chunks of an argument, here goes anyway:

You can run GPT-2 on a weak GPU and it'll do a great job dealing with language and text as such. On the one hand, that's a non-obvious accomplishment on its own right, see nostalgebraist on GPT-2 being able to write for more on that; but on the other hand, well, when was the last time you actually used GPT-2 for anything?

And the reason why you don't do that is... text models have long ago stopped being about text. By far most of what they model is just, well, everything else. Stories, logic, physical intuition, theory of mind, etc.. GPT-2 can do language, and language is pretty straightforward, but all that other stuff is general intelligence, and general intelligence is very, very hard.

But if you're going to do general intelligence, text is a really great modality. It comes pre-processed by language evolution to have a nice, even, and high rate of communicated information, so that if you just compress it a tiny bit, you get a lot of structure and meaning in a tiny amount of input bits. Which in turn means that you can process those with a model that can just focus right away on the hard parts, and use an even amount of computation for everything, and still not really leave much performance on the table.

Image models on the other hand model far less - just the visual universe of pictures on the internet, no big deal - and you probably aren't trying to get them to pull off anything like the feats of reasoning that you expect from language models. Hence, they can do seemingly a lot with little. I've seen someone pull off having Stable Diffusion outpaint the right side of a blackboard with "1+1=" written on the left side, and I think it did pull off putting in a 2, but that's probably just about the extent of reasoning that people expect from image models right now.

Audio I don't really have much of a handle on. One issue with audio models is that if you really want to represent most audio you find online well, you kind of need to be a great language model as well, considering how much audio is speech or song. But at the same time, audio is a far heavier way to represent that language than text is, so it's far harder to learn all of language from audio.

Jordan117 OP t1_ix2r4p3 wrote on November 20, 2022 at 8:07 AM

I'm no expert either but this definitely felt like the sort of question that sounds basic but hits on some fundamental/abstract "theory of information" sort of complexity. It's why I find it so fascinating -- there's something really mysterious and compelling going on in these models that even the researchers themselves are struggling to unravel. Thanks for taking the time!