Submitted by These-Assignment-936 t3_10xjwac in MachineLearning

Just finished reading the Stanford/Google survey paper (https://arxiv.org/abs/2206.07682) on emergent abilities of large language models. It made me wonder: do image generation models have emergent abilities, too? Do we know?

I can't quite wrap my head around what such an ability would even look like. Figured maybe other folks had given this a think.

82

Comments

You must log in or register to comment.

nielsrolf t1_j7stpek wrote

Parti (https://parti.research.google/) showed that being able to spell is an emergent ability. That is the only one I know of, but others that I could imagine are learning compositionally (a blue box between a yellow sphere and a green box), but it's more likely that this is a data issue. Also working out of distribution (a green dog) is a potential candidate. Interesting question

48

These-Assignment-936 OP t1_j7t1t2v wrote

Wow that’s very cool!

7

nielsrolf t1_j7twdn2 wrote

I thought about it again, and another candidate is all LLM capabilities: if you prompt it for "a screenshot of a python method that does xyz" the best solution would be an image that contains working code.

9

visarga t1_j7yftij wrote

There are language models without tokens. They use the raw pixels of an image with text. I can't find the link, Google is not helping me much.

2

ID4gotten t1_j7taz5k wrote

Some 3 dimensional understanding and up/down/gravity seem possible. I think examples of light/shadow/reflection have already been shown. I can't see how it could ever do full tracing but maybe there are heuristics (or overfitting) to be found.

19

mongoosefist t1_j7tt6a2 wrote

Emergent behaviour is called such because we don't yet have the ability to predict it, we can only observe it and deduce where it emerged after the fact. SO, the fact that you can't wrap your head around what such an ability would look like makes perfect sense!

If we're speculating I'd put my money on /u/ID4gotten 's answer. I bet one of these models starts integrating some intuition of physical laws.

13

londons_explorer t1_j7u38tk wrote

Shadows and the way light interacts/reflects/refracts seem to be emergent behaviour of diffusion image models.

Ask for "A koala next to a glistening wine glass", and you'll probably get cool optical effects on the koala that the model has never seen before.

13

Insecure--Login t1_j82e1bn wrote

>and you'll probably get cool optical effects on the koala that the model has never seen before

How could we be absolutely certain the model has never seen said effects?

1

londons_explorer t1_j835xx0 wrote

You search the training image database for pictures of koalas with wine glasses... And there won't be many examples in there, and you check each one.

2

Insecure--Login t1_j86i4gp wrote

You would have to search millions to billions of images manually; that sounds very expensive. And searching using a detection model is not accurate enough.

1

andreichiffa t1_j7t9ul8 wrote

I am pretty sure that was an Anthropic paper first (Predictability and Surprise in Large Generative Models). Makes me truly wonder WTF exactly is going on in Google lately.

As to your question, no one has stacked enough attention layers yet, but there is very high probability that they will. Someone already mentioned the ability to spell, but it could potentially help with things such as hands, number of hands/feet/legs/arms/paws/tails and other things that make a lot of generated images today disturbing.

The issue will most likely be with funding enough data, given that unlike texts most images on the internet are copyrighted (cough Getty cough).

6

currentscurrents t1_j7wk84r wrote

While those are on the same topic, they're very different papers. The Anthropic paper spends most of its time going on about safety/bias/toxicity, while the Google paper is focused on more useful things like the technical abilities of the models.

1

the_new_scientist t1_j7vu5fk wrote

Yes, the DINO paper showed that the ability to perform segmentation emerges from self-supervised vision transformers.

https://arxiv.org/abs/2104.14294

Edit: oops, didn't realize you said image generation models, thought you asked for just vision models.

5

edjez t1_j7t9rp3 wrote

Another emergent capability - and this depends on the model architecture, for example I don’t think Stable Diffusion could have it, but Dalle does - is to generate written letters / “captions” that to us look like gibberish but actually correspond to internal language embeddings for real-world cluster of concepts.

2

DigThatData t1_j7tb03a wrote

i'm not sure that's an emergent ability so much as it is explicitly what the model is being trained to learn. it's not surprising to me that there is a "painting signature" concept it has learned and samples from when it generates gibberish of a particular length and size in the bottom right corner (for example). that sounds like one of the easier "concepts" it would have learned.

11

master3243 t1_j7tmpsz wrote

Exactly, the beginning "Clip" part of the entire Dalle model is trained to take any english text and map it to an embedding space.

It's completely natural (and probably surprising if it doesn't happen) that Clip would map (some) gibberish words to a part of the embedding space that is sufficiently close in L2-distance to the projection of a real world.

In that case, the diffusion model would decode that gibberish word to a similar image generated by the real word.

2

amnezzia t1_j7tav4g wrote

You mean it takes a mean vector of a cluster and makes up a word for it?

1

Mescallan t1_j7tblf5 wrote

word might not be correct, as it implies a consistent alphabet, but semantics aside, yes I believe that is what is happening

1

xenophobe3691 t1_j7xm8wm wrote

Sounds like that story of the guy from 40k who pretty much looked for the underlying connections between all the different kinds of beauty and joy. He found “It” alright…

1

_eminorhan_ t1_j7zwlu9 wrote

People should be more skeptical of "emergent abilities" in big models: 1) Papers claiming such abilities generally use undertrained small models as per chinchilla scaling (compute is not controlled + suboptimal hyperparam choices for small models) and 2) these papers generally use a semilogx plot to demonstrate "emergence" but even a linear relationship will look exponential in such a plot. I'm not sure if I'd want to call a simple linear relationship "emergent".

2

visarga t1_j7yg9mo wrote

Combining objects and styles never seen together in the training set in a plausible way (a baby daikon radish in a tutu walking a dog).

1