Viewing a single comment thread. View all comments

actualsnek t1_j4931de wrote

Text2image generation models do anecdotally appear to be better than image-text matching models at compositional tasks, but if you look closely at some generated images, you'll notice compositional failures. They often apply properties to entities on which the text did not describe them as applied to, or misunderstand the described relation between entities as a more common relation between those entities.

Try a prompt like "man with dog ears running in the park", and it'll generate images of a man with a dog (sometimes with amplified ears) running in the park. Why? Because models don't have the underlying ability to create compositional representations, they instead simply approximate their training data distribution.

Examples like "a raccoon in a spacesuit playing poker" often do well because spacesuits are only ever worn and poker is only ever played (i.e. relations that are common in the training distribution). Try a prompt like "a raccoon sitting on a poker chip and holding a spacesuit" and you'll see pretty drastic failures.

All this being said, generative models *still* appear better than discriminative models for vision-language compositionality tasks, and our current work is exploring approaches to impart this ability onto discriminative models to solve tasks like Winoground.

3