Viewing a single comment thread. View all comments

visarga t1_j0whrvn wrote

Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.

Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).

They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.

The BLIP model can even generate captions - used to fix low quality captions in the training set.

4