calciumcitrate t1_izigomm wrote on December 9, 2022 at 10:07 AM

/u/tetrisdaemon Any idea what part of the diffusion process might be causing the failure modes? (the latent representations, CLIP embeddings, or cross attention conditioning etc.)

My initial guess was that maybe the CLIP embeddings aren't fine grained enough to represent some relationships between entities in a sentence, but if I understand correctly, the cross-attention conditioning adds some additional text supervision (I'm assuming X in eq 4 and 5 is some transformer representation of the prompt) - and it does seem like some dependency relationship are being captured.

tetrisdaemon OP t1_izjp9nc wrote on December 9, 2022 at 4:40 PM

I'm looking into it, but I'm guessing it's the CLIP embeddings, so disentanglement might need to happen at that level. Some supporting evidence is that even if we set the cross attention to zero (for some words), it'll still reflect in the final image, indicating that the word representations are mixed in CLIP.