Tyanuh t1_j0vwmsi wrote
Reply to comment by visarga in Prediction: De-facto Pure AGI is going to be arriving next year. Pessimistically in 3 years. by Ace_Snowlight
This is an interesting thought.
I would say what it also lacks is the ability to associate information about a concept through multiple "senses".
Once AI gets the ability to associate visual input with verbal input for example, you will slowly build up a network of connections that is, in a sense, embodied, and actually connected to 'being' in an ontologicsl sense.
visarga t1_j0whrvn wrote
Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.
Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).
They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.
The BLIP model can even generate captions - used to fix low quality captions in the training set.
Viewing a single comment thread. View all comments