With the advent of stable diffusion/midjourney/dalle and upcoming text-to-video models from Google and Meta, what will be major challenges in computer vision? It feels like once text-to-video models get released, visual reasoning will be mostly solved, and the only thing left to do is to improve model accuracy/efficiency from there. I am fairly new to Computer Vision and would love to learn new possible areas of research. Thank you in advance!

Comments

You must log in or register to comment.

ml-research t1_j90suwh wrote on February 18, 2023 at 10:30 AM

#1,855,072

Finding open problems

Ulfgardleo t1_j912luv wrote on February 18, 2023 at 12:38 PM

#1,855,763

computer vision is a much broader problem domain than text to image or text to video. AFAIK 3D pose estimation under occlusions is an unsolved problem, still.

master3243 t1_j91xkeo wrote on February 18, 2023 at 4:46 PM

#1,858,523

There's way more to computer vision than what you listed.

Long form video understanding is still incredibly limited. Compared to the current SOTA capabilities of LLM to understand very long text and the various advancements in text summarization, video understanding seems to have an incredibly long ways to go.

Our current models can understand relatively very simple actions (sitting/standing/dancing) however compared to text, we want to reach a level where we can understand entire scenes in a movie or maybe even an entire movie, although that's more of a fantasy currently. Not to mention all the 3D input (instead of a projection 2D image) which adds extra complexity.

Comfortable_Use_5033 t1_j91z9ee wrote on February 18, 2023 at 4:57 PM

#1,858,682

semantic synthesis, I know that it has made a lots progress with those text-to-image diffusion models, but what I notice is that not much work is invested in semantic generation, especially video generation, or maybe I have just missed something.

slashdave t1_j92o93s wrote on February 18, 2023 at 7:49 PM

#1,860,782

Generative models for text to video don't have much to do with the reverse, video to text (label).

stringerbell50 t1_j92tizo wrote on February 18, 2023 at 8:26 PM

#1,861,242

Image Segmentation.

buyIdris666 t1_j93m0ol wrote on February 19, 2023 at 12:00 AM

#1,863,689

Video will remain unsolved for a while.

LLM came first because the bit rate is lowest. A sentence of text is only a few hundred bits of information.

Now, image generation is getting good. It's still not perfect. The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

Video is even harder. 30 high res images a second. To make long, coherent, believable videos takes an enormous amount of data and processing power

uwashingtongold t1_j94qte2 wrote on February 19, 2023 at 5:47 AM

#1,867,013

Grounded vision understanding for qa relating to spatial concepts

currentscurrents t1_j96n0v8 wrote on February 19, 2023 at 5:37 PM

#1,872,371

Replying to stringerbell50 (#1,861,242)

Isn't that doing pretty good these days? CNNs can not only segment, but even semantically label every pixel in an image.

On a practical level, I have used Photoshop's new object select and love it. It does a better job at masking than I do.

currentscurrents t1_j96pkvw wrote on February 19, 2023 at 5:54 PM

#1,872,583

Replying to buyIdris666 (#1,863,689)

> The models are larger because there's maybe 100x the information in a high res image than a paragraph of text.

That's actually not true. Today's LLMs are 175B parameters, Stable Diffusion is 890 million.

Images contain a lot of pixels, but most of those pixels are easy to predict and don't contain much high-level information. A paragraph of text can contain many complex abstract ideas, while an image usually only contains a few objects with simple relationships between them.

In many image generators (like Imagen), the language model they use to understand the prompt is several times bigger than the diffuser they use to generate the image.

Ulfgardleo t1_j976icn wrote on February 19, 2023 at 7:52 PM

#1,874,004

Replying to currentscurrents (#1,872,371)

we can do image segmentation, but segmentation uncertainties are a bit iffy. we can do pixel-wise uncertainties, but that really is not what we want because neighbouring pixels are not independent. e.g., if you have a detect-and-segment task, then with an uncertain detection, your segmentation masks should reflect that sometimes "nothing" is detected and thus there is nothing to segment. i think we have not progressed there beyond ising model variations.

buyIdris666 t1_j97eom6 wrote on February 19, 2023 at 8:48 PM

#1,874,696

Replying to currentscurrents (#1,872,583)

Interesting! I didn't realize that

Ol_OLUs22 t1_j97s5vv wrote on February 19, 2023 at 10:23 PM

#1,875,668

adversarial examples

currentscurrents t1_j99iq9v wrote on February 20, 2023 at 7:39 AM

#1,881,232

Replying to buyIdris666 (#1,874,696)

Video has even less information density, since frames are similar to each other! Video codecs can get crazy compression rates like 99% on slow-moving video.

But you still have to process a lot of pixels, so text-to-video generators are held back by memory requirements.