Viewing a single comment thread. View all comments

Ortus14 t1_j2mfz2n wrote

It's called Flamingo. It can't do all of that yet but it can solve problems that combine text and images.

https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model

If I remember Correctly, Open Ai also has the goal of combining vision and LLM systems on their path of creating more and more general Ais.

2

airduster_9000 t1_j2mg9rc wrote

People assume that GPT4 might be multi-modal- and be able to handle more than juts text. Since its OpenAI - combining GPT, CLIP and Dall-E at some point seems given.

1

Akimbo333 t1_j2mimns wrote

Wow that's cool! What is CLIP?

1

airduster_9000 t1_j2mrgtl wrote

CLIP is the eyes that let it see images - not just read text and symbols.

​

GPT = Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.

CHATGPT = Special trained version of GPT3.5 for chat.

DALL-E = DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed to generate digital images from natural language descriptions, called "prompts".

CLIP = CLIP does the opposite of DALL-E: it creates a text-description for a given image. Read more here: https://openai.com/blog/clip/

2

Akimbo333 t1_j2miqzn wrote

Oh so it can only do images. Now that's disappointing! But still cool though!

1