https://preview.redd.it/m2pg8yhahr4a1.png?width=2117&format=png&auto=webp&s=c6ef4cbef10f5d04045fb606e5123fb7a64f2ed5

Paper: What the DAAM: Interpreting Stable Diffusion Using Cross Attention (arXiv paper, codebase)

Abstract:

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

Authors: Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture

Comments

CatalyzeX_code_bot t1_izgpngr wrote on December 8, 2022 at 11:50 PM

#889,737

Found relevant code at https://github.com/castorini/daam + all code implementations here

To opt out from receiving code links, DM me

Parzival_007 t1_izhqde5 wrote on December 9, 2022 at 4:42 AM

#891,334

Hi, I checked your work before you posted, and daam it's good. Well done !

tetrisdaemon OP t1_izhrg1k wrote on December 9, 2022 at 4:52 AM

#891,369

Replying to Parzival_007 (#891,334)

Thanks! We're actively improving it.

moschles t1_izhydos wrote on December 9, 2022 at 6:02 AM

#891,628

> To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research.

It seems like the lines of research here would be automated photo captioning.

tetrisdaemon OP t1_izi47x8 wrote on December 9, 2022 at 7:12 AM

#891,834

Replying to moschles (#891,628)

For sure, and how linguistics can guide Stable Diffusion to produce better images. For example, if we already understand how objects should relate on the language side (e.g., "a giraffe and a zebra" should probably produce two distinct animals, unlike that observed in the paper), we can twiddle the attention maps so that the giraffe and the zebra are separate.

calciumcitrate t1_izigomm wrote on December 9, 2022 at 10:07 AM

#892,144

/u/tetrisdaemon Any idea what part of the diffusion process might be causing the failure modes? (the latent representations, CLIP embeddings, or cross attention conditioning etc.)

My initial guess was that maybe the CLIP embeddings aren't fine grained enough to represent some relationships between entities in a sentence, but if I understand correctly, the cross-attention conditioning adds some additional text supervision (I'm assuming X in eq 4 and 5 is some transformer representation of the prompt) - and it does seem like some dependency relationship are being captured.

Purplekeyboard t1_izih5hd wrote on December 9, 2022 at 10:14 AM

#892,157

>descriptive adjectives attend too broadly.

If this means that words in a prompt modify the whole prompt and not just the phrase the word is part of, everyone who uses Stable Diffusion knows this. If your prompt is "girl, chair, sitting, computer, library, earrings, necklace, blonde hair, hat", and you modify that to specify "red chair", you're likely to also get a red hat, or now the girl will be wearing a red shirt, or various other parts of the image may turn red.

If you change the prompt from library to outdoors, and add the word snow, it will likely be snowing, but also the earrings or a pendant on the necklace may now be in the shape of a snowflake.

This is how stable diffusion works.

JClub t1_izij5x5 wrote on December 9, 2022 at 10:42 AM

#892,229

Hey! I'm the author of https://github.com/JoaoLages/diffusers-interpret

I have also tried to collect attentions in the diffusion process but the matrices with (text size, image size) were too big to keep in RAM/VRAM, how did you solve that problem?

tetrisdaemon OP t1_izjm0ov wrote on December 9, 2022 at 4:19 PM

#893,784

Replying to JClub (#892,229)

Cool, nicely done repository. Are you referring to the [16, 4096-ish, 77] cross-attention matrices? I maintained a streaming sum over matrices of the same size on a 64GB (though it does work with 32GB) RAM and 24GB VRAM machine.

tetrisdaemon OP t1_izjmb5s wrote on December 9, 2022 at 4:21 PM

#893,791

Replying to Purplekeyboard (#892,157)

This is a good observation. Actually, in the paper we try out "{rusty, wooden, metallic} shovel in a clean shed," and it still made the shed rusty. Moving forward, we do plan to do the same thing to the other ball prompt.

JClub t1_izjnf35 wrote on December 9, 2022 at 4:28 PM

#893,827

Replying to tetrisdaemon (#893,784)

Damn then this method can only run on such hardware, the attention weights are very heavy!

tetrisdaemon OP t1_izjp9nc wrote on December 9, 2022 at 4:40 PM

#893,898

Replying to calciumcitrate (#892,144)

I'm looking into it, but I'm guessing it's the CLIP embeddings, so disentanglement might need to happen at that level. Some supporting evidence is that even if we set the cross attention to zero (for some words), it'll still reflect in the final image, indicating that the word representations are mixed in CLIP.

tetrisdaemon OP t1_izk7fk0 wrote on December 9, 2022 at 6:33 PM

#894,640

Replying to JClub (#893,827)

Yeah, moving forward it might help to have a disk caching mode.

[R] What the DAAM: Interpreting Stable Diffusion and Uncovering Generation Entanglement