Stable diffusion seems to be a departure from the trend of building larger and larger models.

It has 10x less parameters than other image generation models like DALLE-2.

“Incredibly, compared with DALL-E 2 and Imagen, the Stable Diffusion model is a lot smaller. While DALL-E 2 has around 3.5 Billion parameters, and Imagen has 4.6 Billion, the first Stable Diffusion model has just 890 million parameters, which means it uses a lot less VRAM and can actually be run on consumer-grade graphics cards.”

What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example?

Comments

You must log in or register to comment.

LetterRip t1_j6v57y5 wrote on February 2, 2023 at 2:33 AM

#1,700,233

Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.

Ne_Nel t1_j6va0z6 wrote on February 2, 2023 at 3:09 AM

#1,700,500

Pixel vs Latent.

Mefaso t1_j6vdzji wrote on February 2, 2023 at 3:40 AM

#1,700,771

Replying to Ne_Nel (#1,700,500)

Exactly, the entire point of Latent Diffusion Models was to make them smaller and faster

londons_explorer t1_j6wa910 wrote on February 2, 2023 at 9:48 AM

#1,702,170

It's a much smaller model, but IMO, the results are much lower quality too.

However the fact you can run it on your PC means you can tweak all the settings and have many goes at getting better results, partially offsetting that.

uhules t1_j6wrx63 wrote on February 2, 2023 at 1:16 PM

#1,702,889

Replying to Mefaso (#1,700,771)

Except DALL-E 2 also applies diffusion in latent space and Imagen performs diffusion in low-res pixel space. My initial hunch was the upscaling diffusion models, but they account for a relatively small portion of the total number of parameters and are more relevant speed-wise. The lackluster explanation is simply "SD does latent better", since you'd need to do an extensive ablation study to compare rather different architectures.

i_wayyy_over_think t1_j6wup4m wrote on February 2, 2023 at 1:40 PM

#1,703,063

Replying to londons_explorer (#1,702,170)

Also being able to easily fine tune a model makes gens on your particular subject higher quality than what you can get on anything else that’s not fine tuned.

Mefaso t1_j6z6zgt wrote on February 2, 2023 at 10:43 PM

#1,708,117

Replying to uhules (#1,702,889)

>DALL-E 2 also applies diffusion in latent space

Not really in the important part. Dalle2 uses diffusion in clip-"latent"-space and then conditions the pixel-diffusion model on the result.

However they still do a full diffusion pass in pixel-space, which is more complex than doing it in latent space, as LDMs do.