Submitted by Empty-Revolution7570 t3_11sfj5s in MachineLearning

The newly released GPT-4 allows users to upload images, but we're still far from having a truly capable multimodal model. So we built this project as a feasibility study (and for fun!) to see how much we can do with just tuning the prompts. In short, we try to "connect" different models (vision, audio, etc) via carefully designed prompts.

Multimedia GPT connects your OpenAI GPT with vision and audio. You can now send images, videos (in development), and even audio recordings using your OpenAI API key. We base our project on Microsoft's Visual ChatGPT, which achieves some success just by tuning the prompts.

Check-out our project here! We also have a cool demo where Multimedia GPT successfully understands a person telling a story!

​

https://preview.redd.it/6x6pjamt30oa1.png?width=3024&format=png&auto=webp&v=enabled&s=30f6c9e5b9329642ebda40241f4ac2aca464c4d8

https://preview.redd.it/3dr5tamt30oa1.png?width=2950&format=png&auto=webp&v=enabled&s=9b3fc71822a7b1f9bc008ffb57b49b6b2c4bfb6d

Any suggestion is appreciated~

1

Comments

You must log in or register to comment.

MysteryInc152 t1_jcdthob wrote

Are you using Gpt-Vision ? Or are there separate assortments of visual foundation models ?

2

ml_head t1_jcf3e64 wrote

So, the model recognized the Cinderella story in the audio. But how do we know that summary was generated from the audio, and not from prior knowledge of the story? I know that those models can do this task. However, for the demo I would use an original story instead.

1

ml_head t1_jcjyon2 wrote

I'm sure that it does. And would beca better demo of the technology. Maybe, keep the Cinderella story too, since some people wouldn't read your original story and wouldn't be able to tell if the summary is good. You might want to add an image with your original story in a format that wouldn't be easy to OCR, like using weird font on noisy background. In this way you are making the story available to humans but taking measures to hide it from any web crawler used by language models.

1