Viewing a single comment thread. View all comments

Tyanuh t1_j0vwmsi wrote

This is an interesting thought.

I would say what it also lacks is the ability to associate information about a concept through multiple "senses".

Once AI gets the ability to associate visual input with verbal input for example, you will slowly build up a network of connections that is, in a sense, embodied, and actually connected to 'being' in an ontologicsl sense.

9

visarga t1_j0whrvn wrote

Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.

Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).

They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.

The BLIP model can even generate captions - used to fix low quality captions in the training set.

4