master3243 t1_j91xkeo wrote on February 18, 2023 at 4:46 PM

There's way more to computer vision than what you listed.

Long form video understanding is still incredibly limited. Compared to the current SOTA capabilities of LLM to understand very long text and the various advancements in text summarization, video understanding seems to have an incredibly long ways to go.

Our current models can understand relatively very simple actions (sitting/standing/dancing) however compared to text, we want to reach a level where we can understand entire scenes in a movie or maybe even an entire movie, although that's more of a fantasy currently. Not to mention all the 3D input (instead of a projection 2D image) which adds extra complexity.