Viewing a single comment thread. View all comments

giga-chad99 t1_j45pa30 wrote

Regarding that Winoground paper: Isn't compositionally what DALLE-2, Image, Parti, etc are famous for? Like the avocado chair, or some very specific images like "a raccoon in a spacesuit playing poker". SOTA vision language model are the only models that actually show convincing compositionally, or am I wrong?

4

chaosmosis t1_j45vdll wrote

With enough scale we get crude compositionality, yes. That trend will probably continue, but I don't think it'll take us to the moon.

3

yldedly t1_j45ycm8 wrote

>With enough scale we get crude compositionality, yes.

Depends on exactly what we mean. To take a simple example, if you have cos(x) and x^2, you can compose these to produce cos(x)^2 (or cos(x^2)). You can approximate the composition using a neural network if you have enough data on some interval x \in [a;b]. It will work well even for x that weren't part of the training set, as long as they are in the interval. Outside the interval the approximation will be bad though. But if you take cos(x), x^2 and compose(f, g) as building blocks, and search for a combination of these that approximate the data, the approximation will be good for all real numbers.

In the same way, you can learn a concept like "subject, preposition, object A, transitive verb, object B", where e.g. subject = "raccoon", preposition = "in a", object A = "spacesuit", transitive verb = "playing" and object B = "poker", by approximating it with a neural network, and it will work well if you have enough data in some high-dimensional subspace. But it won't work with any substitutions. Is it fair to call that crude compositionality?

4

actualsnek t1_j4931de wrote

Text2image generation models do anecdotally appear to be better than image-text matching models at compositional tasks, but if you look closely at some generated images, you'll notice compositional failures. They often apply properties to entities on which the text did not describe them as applied to, or misunderstand the described relation between entities as a more common relation between those entities.

Try a prompt like "man with dog ears running in the park", and it'll generate images of a man with a dog (sometimes with amplified ears) running in the park. Why? Because models don't have the underlying ability to create compositional representations, they instead simply approximate their training data distribution.

Examples like "a raccoon in a spacesuit playing poker" often do well because spacesuits are only ever worn and poker is only ever played (i.e. relations that are common in the training distribution). Try a prompt like "a raccoon sitting on a poker chip and holding a spacesuit" and you'll see pretty drastic failures.

All this being said, generative models *still* appear better than discriminative models for vision-language compositionality tasks, and our current work is exploring approaches to impart this ability onto discriminative models to solve tasks like Winoground.

3