Viewing a single comment thread. View all comments

actualsnek t1_j44m1z9 wrote

Compositionality is increasingly a significant area of concern across many subfields of deep learning. Winoground recently showed that all state-of-the-art vision-language models drastically fail to comprehend compositional structure, a feature which many linguists would argue is fundamental to the expressive power of language.

Smolensky is also a great guy and was affiliated with the PDP group that developed backprop in the 80's. The best path to neurosymbolic computing & compositional reasoning remains unclear, but Smolensky and his student Tom McCoy have done some great work over the last few years exploring how symbolic structures are implicitly represented in neural nets.

12

giga-chad99 t1_j45pa30 wrote

Regarding that Winoground paper: Isn't compositionally what DALLE-2, Image, Parti, etc are famous for? Like the avocado chair, or some very specific images like "a raccoon in a spacesuit playing poker". SOTA vision language model are the only models that actually show convincing compositionally, or am I wrong?

4

chaosmosis t1_j45vdll wrote

With enough scale we get crude compositionality, yes. That trend will probably continue, but I don't think it'll take us to the moon.

3

yldedly t1_j45ycm8 wrote

>With enough scale we get crude compositionality, yes.

Depends on exactly what we mean. To take a simple example, if you have cos(x) and x^2, you can compose these to produce cos(x)^2 (or cos(x^2)). You can approximate the composition using a neural network if you have enough data on some interval x \in [a;b]. It will work well even for x that weren't part of the training set, as long as they are in the interval. Outside the interval the approximation will be bad though. But if you take cos(x), x^2 and compose(f, g) as building blocks, and search for a combination of these that approximate the data, the approximation will be good for all real numbers.

In the same way, you can learn a concept like "subject, preposition, object A, transitive verb, object B", where e.g. subject = "raccoon", preposition = "in a", object A = "spacesuit", transitive verb = "playing" and object B = "poker", by approximating it with a neural network, and it will work well if you have enough data in some high-dimensional subspace. But it won't work with any substitutions. Is it fair to call that crude compositionality?

4

actualsnek t1_j4931de wrote

Text2image generation models do anecdotally appear to be better than image-text matching models at compositional tasks, but if you look closely at some generated images, you'll notice compositional failures. They often apply properties to entities on which the text did not describe them as applied to, or misunderstand the described relation between entities as a more common relation between those entities.

Try a prompt like "man with dog ears running in the park", and it'll generate images of a man with a dog (sometimes with amplified ears) running in the park. Why? Because models don't have the underlying ability to create compositional representations, they instead simply approximate their training data distribution.

Examples like "a raccoon in a spacesuit playing poker" often do well because spacesuits are only ever worn and poker is only ever played (i.e. relations that are common in the training distribution). Try a prompt like "a raccoon sitting on a poker chip and holding a spacesuit" and you'll see pretty drastic failures.

All this being said, generative models *still* appear better than discriminative models for vision-language compositionality tasks, and our current work is exploring approaches to impart this ability onto discriminative models to solve tasks like Winoground.

3

visarga t1_j46a4x8 wrote

Would a dataset engineering approach work here? - generate and solve training problems with compositional structure, after sufficient examples it should generalise.

2

actualsnek t1_j493haq wrote

We're exploring some data augmentation approaches right now (see my response to u/giga-chad99) but how would you propose generating those problems with compositional structure?

1

visarga t1_j4cqkkb wrote

Sometimes you can exploit asymmetrical difficulty. For example, factorising polynomials is hard but multiplying a bunch of degree 1 polynomials is easy. So you can generate data for free, and it will be very diverse. The data is such that is has a compositional structure, it will necessitate applying rules correctly without overfitting.

Taking derivatives and integrals is similar - easy one way, hard the other way. And solving the task will teach the model something about symbolic manipulation.

More generally you can use an external process, a simulator, an algorithm or a search engine to obtain a transformation of input X to Y, then learn to predict Y from X or X from Y. "Given this partial game of chess, predict who wins" and such. If X has compositional structure, solving the task would teach the model how to generalise, because you can generate as much data as necessary to force it not to overfit.

2