moschles

moschles t1_j62iaxi wrote

GANs produce an image "cut from the whole cloth" at once.

Diffusion models are using a trick -- wherein between rounds of incremental noise removal, they perform a super resolution round.

Technically speaking, you could start from GAN output, and then take it through rounds of super-resolution. The result would look a lot like what diffusion models produce. This leaves a question as to how the new details would be guided, ( or more technically), what the super resolution features would be conditioned upon. If you are going to condition them on text embeddings, you might as well condition the whole process on the same embedding . . . now you just have a diffusion model.

A second weakness of GANs is the narrowness of their variety. When made to produce vectors corresponding to a category "dog" , they tend to produce to nearly exactly the same dog each time.

−2

moschles t1_j4nch5w wrote

> Seems to be derived by observing that the most promising work in robotics today (where generating data is challenging) is coming from piggy-backing on the success of large language models (think SayCan etc).

There is nothing really magical being claimed here. The LLMs are undergoing unsupervised training. essentially by creating distortions of the text. (one type of "distortion" is Cloze Deletion. But there are others in the panoply of distorted text.)

Unsupervised training avoids the bottleneck of having to manually pre-label your dataset.

When we translate unsupervised training to the robotics domain, what does that look like? Perhaps "next word prediction" is analogous to "next second prediction" of a physical environment. And Cloze Deletion has an analogy to probabilistic "in-painting" done by existing diffusion models.

That's the way I see it. I'm not particular sold on this idea that the pretraining would be literal LLM trained on text, ported seamlessly to the robotics domain. If I'm wrong, set me straight.

1