actualsnek

actualsnek t1_j495737 wrote

Crazy that this has even a single upvote and proof that this subreddit is no longer the community for academic discourse it once was. Do you know who Paul Smolensky is? He practically invented the term "neuro-symbolic" and was virtually the only researcher seriously working on it in the 20 years leading up to the deep learning revolution. Harmonic Grammar, Optimality Theory, Tensor Product Representations. Please pick up perhaps any article on connectionism before 2010.

No, this is not a new term for for neuro-symbolic computing (which is now just a buzzword applicable to half of the field), it's a specific theoretical take on how compositional structure could be captured by vectorial representations.

2

actualsnek t1_j4931de wrote

Text2image generation models do anecdotally appear to be better than image-text matching models at compositional tasks, but if you look closely at some generated images, you'll notice compositional failures. They often apply properties to entities on which the text did not describe them as applied to, or misunderstand the described relation between entities as a more common relation between those entities.

Try a prompt like "man with dog ears running in the park", and it'll generate images of a man with a dog (sometimes with amplified ears) running in the park. Why? Because models don't have the underlying ability to create compositional representations, they instead simply approximate their training data distribution.

Examples like "a raccoon in a spacesuit playing poker" often do well because spacesuits are only ever worn and poker is only ever played (i.e. relations that are common in the training distribution). Try a prompt like "a raccoon sitting on a poker chip and holding a spacesuit" and you'll see pretty drastic failures.

All this being said, generative models *still* appear better than discriminative models for vision-language compositionality tasks, and our current work is exploring approaches to impart this ability onto discriminative models to solve tasks like Winoground.

3

actualsnek t1_j44m1z9 wrote

Compositionality is increasingly a significant area of concern across many subfields of deep learning. Winoground recently showed that all state-of-the-art vision-language models drastically fail to comprehend compositional structure, a feature which many linguists would argue is fundamental to the expressive power of language.

Smolensky is also a great guy and was affiliated with the PDP group that developed backprop in the 80's. The best path to neurosymbolic computing & compositional reasoning remains unclear, but Smolensky and his student Tom McCoy have done some great work over the last few years exploring how symbolic structures are implicitly represented in neural nets.

12