Submitted by Fabulous-Let-822 t3_115btl3 in MachineLearning
master3243 t1_j91xkeo wrote
There's way more to computer vision than what you listed.
Long form video understanding is still incredibly limited. Compared to the current SOTA capabilities of LLM to understand very long text and the various advancements in text summarization, video understanding seems to have an incredibly long ways to go.
Our current models can understand relatively very simple actions (sitting/standing/dancing) however compared to text, we want to reach a level where we can understand entire scenes in a movie or maybe even an entire movie, although that's more of a fantasy currently. Not to mention all the 3D input (instead of a projection 2D image) which adds extra complexity.
Viewing a single comment thread. View all comments