Viewing a single comment thread. View all comments

ThirdMover t1_jdhvx8i wrote

>GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?

You could just ask it to move a cursor around until it's on the specified element. I'd be shocked if GPT-4 couldn't do that.


MjrK t1_jdiflsw wrote

I'm confident that someone can fine-tune an end-to-end vision-tranformer that can extract user interface elements from photos and enumerate interaction options.

Seems like such an obviously-useful tool and Vit-22B should be able to handle it, or many other Computer Vision tools on Hugging Face... I would've assumed some grad student somewhere is already hacking away at that.

But then also, compute costs are a b**** but generating training data set should be somewhat easy.

Free research paper idea, I guess.


modcowboy t1_jdkz6of wrote

Probably would be easier for the LLM to interact with the website directly through the inspect tool vs machine vision training.


MjrK t1_jdm4ola wrote

For many (perhaps these days, most) use cases, absolutely! The advantage of vision in some others might be interacting more directly with the browser itself, as well as other applications, and multi-tasking... perhaps similar to the way we use PCs and mobile devices to accomplish more complex tasks


plocco-tocco t1_jdj9is4 wrote

It woulde be quite expensive to do tho. You have to do inference very fast with multiple images of your screen, don't know if it is even feasible.


ThirdMover t1_jdjf69i wrote

I am not sure. Exactly how does inference scale with the complexity of the input? The output would be very short, just enough tokens for the "move cursor to" command.


plocco-tocco t1_jdjx7qz wrote

The complexity of the input wouldn't change in this case since it's just a screen grab of the display. Just that you'd need to do inference at a certain frame rate to be able to detect the cursor, which isn't that cheap with GPT-4. Now, I'm not sure what the latency or cost would be, I'd need to get access to the API to answer it.


MassiveIndependence8 t1_jdl9oq9 wrote

You’re actually suggesting putting every single frame into gpt-4? It’ll cost you a fortune after 5 seconds of running it. Plus the latency is super high, it might takes you an hour to process a “5 seconds” worth of images.


ThirdMover t1_jdlabwm wrote

What do you mean by "frame"? How many images do you think GPT-4 would need to get a cursor where it needs to go? I'd estimate four or five should be plenty.