Submitted by Fabulous-Let-822 t3_115btl3 in MachineLearning

With the advent of stable diffusion/midjourney/dalle and upcoming text-to-video models from Google and Meta, what will be major challenges in computer vision? It feels like once text-to-video models get released, visual reasoning will be mostly solved, and the only thing left to do is to improve model accuracy/efficiency from there. I am fairly new to Computer Vision and would love to learn new possible areas of research. Thank you in advance!

19

Comments

You must log in or register to comment.

Ulfgardleo t1_j912luv wrote

computer vision is a much broader problem domain than text to image or text to video. AFAIK 3D pose estimation under occlusions is an unsolved problem, still.

34

master3243 t1_j91xkeo wrote

There's way more to computer vision than what you listed.

Long form video understanding is still incredibly limited. Compared to the current SOTA capabilities of LLM to understand very long text and the various advancements in text summarization, video understanding seems to have an incredibly long ways to go.

Our current models can understand relatively very simple actions (sitting/standing/dancing) however compared to text, we want to reach a level where we can understand entire scenes in a movie or maybe even an entire movie, although that's more of a fantasy currently. Not to mention all the 3D input (instead of a projection 2D image) which adds extra complexity.

22

Comfortable_Use_5033 t1_j91z9ee wrote

semantic synthesis, I know that it has made a lots progress with those text-to-image diffusion models, but what I notice is that not much work is invested in semantic generation, especially video generation, or maybe I have just missed something.

2

slashdave t1_j92o93s wrote

Generative models for text to video don't have much to do with the reverse, video to text (label).

3

buyIdris666 t1_j93m0ol wrote

Video will remain unsolved for a while.

LLM came first because the bit rate is lowest. A sentence of text is only a few hundred bits of information.

Now, image generation is getting good. It's still not perfect. The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

Video is even harder. 30 high res images a second. To make long, coherent, believable videos takes an enormous amount of data and processing power

5

uwashingtongold t1_j94qte2 wrote

Grounded vision understanding for qa relating to spatial concepts

1

currentscurrents t1_j96n0v8 wrote

Isn't that doing pretty good these days? CNNs can not only segment, but even semantically label every pixel in an image.

On a practical level, I have used Photoshop's new object select and love it. It does a better job at masking than I do.

2

currentscurrents t1_j96pkvw wrote

> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.

Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.

In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.

7

Ulfgardleo t1_j976icn wrote

we can do image segmentation, but segmentation uncertainties are a bit iffy. we can do pixel-wise uncertainties, but that really is not what we want because neighbouring pixels are not independent. e.g., if you have a detect-and-segment task, then with an uncertain detection, your segmentation masks should reflect that sometimes "nothing" is detected and thus there is nothing to segment. i think we have not progressed there beyond ising model variations.

9

currentscurrents t1_j99iq9v wrote

Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.

But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.

2