Submitted by These-Assignment-936 t3_10xjwac in MachineLearning

Just finished reading the Stanford/Google survey paper (https://arxiv.org/abs/2206.07682) on emergent abilities of large language models. It made me wonder: do image generation models have emergent abilities, too? Do we know?

I can't quite wrap my head around what such an ability would even look like. Figured maybe other folks had given this a think.

82

Comments

You must log in or register to comment.

nielsrolf t1_j7stpek wrote

Parti (https://parti.research.google/) showed that being able to spell is an emergent ability. That is the only one I know of, but others that I could imagine are learning compositionally (a blue box between a yellow sphere and a green box), but it's more likely that this is a data issue. Also working out of distribution (a green dog) is a potential candidate. Interesting question

48

edjez t1_j7t9rp3 wrote

Another emergent capability - and this depends on the model architecture, for example I don’t think Stable Diffusion could have it, but Dalle does - is to generate written letters / “captions” that to us look like gibberish but actually correspond to internal language embeddings for real-world cluster of concepts.

2

andreichiffa t1_j7t9ul8 wrote

I am pretty sure that was an Anthropic paper first (Predictability and Surprise in Large Generative Models). Makes me truly wonder WTF exactly is going on in Google lately.

As to your question, no one has stacked enough attention layers yet, but there is very high probability that they will. Someone already mentioned the ability to spell, but it could potentially help with things such as hands, number of hands/feet/legs/arms/paws/tails and other things that make a lot of generated images today disturbing.

The issue will most likely be with funding enough data, given that unlike texts most images on the internet are copyrighted (cough Getty cough).

6

ID4gotten t1_j7taz5k wrote

Some 3 dimensional understanding and up/down/gravity seem possible. I think examples of light/shadow/reflection have already been shown. I can't see how it could ever do full tracing but maybe there are heuristics (or overfitting) to be found.

19

DigThatData t1_j7tb03a wrote

i'm not sure that's an emergent ability so much as it is explicitly what the model is being trained to learn. it's not surprising to me that there is a "painting signature" concept it has learned and samples from when it generates gibberish of a particular length and size in the bottom right corner (for example). that sounds like one of the easier "concepts" it would have learned.

11

master3243 t1_j7tmpsz wrote

Exactly, the beginning "Clip" part of the entire Dalle model is trained to take any english text and map it to an embedding space.

It's completely natural (and probably surprising if it doesn't happen) that Clip would map (some) gibberish words to a part of the embedding space that is sufficiently close in L2-distance to the projection of a real world.

In that case, the diffusion model would decode that gibberish word to a similar image generated by the real word.

2

mongoosefist t1_j7tt6a2 wrote

Emergent behaviour is called such because we don't yet have the ability to predict it, we can only observe it and deduce where it emerged after the fact. SO, the fact that you can't wrap your head around what such an ability would look like makes perfect sense!

If we're speculating I'd put my money on /u/ID4gotten 's answer. I bet one of these models starts integrating some intuition of physical laws.

13

londons_explorer t1_j7u38tk wrote

Shadows and the way light interacts/reflects/refracts seem to be emergent behaviour of diffusion image models.

Ask for "A koala next to a glistening wine glass", and you'll probably get cool optical effects on the koala that the model has never seen before.

13

the_new_scientist t1_j7vu5fk wrote

Yes, the DINO paper showed that the ability to perform segmentation emerges from self-supervised vision transformers.

https://arxiv.org/abs/2104.14294

Edit: oops, didn't realize you said image generation models, thought you asked for just vision models.

5

currentscurrents t1_j7wk84r wrote

While those are on the same topic, they're very different papers. The Anthropic paper spends most of its time going on about safety/bias/toxicity, while the Google paper is focused on more useful things like the technical abilities of the models.

1

visarga t1_j7yg9mo wrote

Combining objects and styles never seen together in the training set in a plausible way (a baby daikon radish in a tutu walking a dog).

1

_eminorhan_ t1_j7zwlu9 wrote

People should be more skeptical of "emergent abilities" in big models: 1) Papers claiming such abilities generally use undertrained small models as per chinchilla scaling (compute is not controlled + suboptimal hyperparam choices for small models) and 2) these papers generally use a semilogx plot to demonstrate "emergence" but even a linear relationship will look exponential in such a plot. I'm not sure if I'd want to call a simple linear relationship "emergent".

2