Submitted by dahdarknite t3_10r5gku in MachineLearning

Stable diffusion seems to be a departure from the trend of building larger and larger models.

It has 10x less parameters than other image generation models like DALLE-2.

“Incredibly, compared with DALL-E 2 and Imagen, the Stable Diffusion model is a lot smaller. While DALL-E 2 has around 3.5 Billion parameters, and Imagen has 4.6 Billion, the first Stable Diffusion model has just 890 million parameters, which means it uses a lot less VRAM and can actually be run on consumer-grade graphics cards.”

What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example?

25

Comments

You must log in or register to comment.

LetterRip t1_j6v57y5 wrote

Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.

18

Ne_Nel t1_j6va0z6 wrote

Pixel vs Latent.

17

Mefaso t1_j6vdzji wrote

Exactly, the entire point of Latent Diffusion Models was to make them smaller and faster

8

uhules t1_j6wrx63 wrote

Except DALL-E 2 also applies diffusion in latent space and Imagen performs diffusion in low-res pixel space. My initial hunch was the upscaling diffusion models, but they account for a relatively small portion of the total number of parameters and are more relevant speed-wise. The lackluster explanation is simply "SD does latent better", since you'd need to do an extensive ablation study to compare rather different architectures.

4

Mefaso t1_j6z6zgt wrote

>DALL-E 2 also applies diffusion in latent space

Not really in the important part. Dalle2 uses diffusion in clip-"latent"-space and then conditions the pixel-diffusion model on the result.

However they still do a full diffusion pass in pixel-space, which is more complex than doing it in latent space, as LDMs do.

1

londons_explorer t1_j6wa910 wrote

It's a much smaller model, but IMO, the results are much lower quality too.

However the fact you can run it on your PC means you can tweak all the settings and have many goes at getting better results, partially offsetting that.

2

i_wayyy_over_think t1_j6wup4m wrote

Also being able to easily fine tune a model makes gens on your particular subject higher quality than what you can get on anything else that’s not fine tuned.

2