MjrK t1_jdiflsw wrote on March 24, 2023 at 4:50 PM

Reply to comment by ThirdMover in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

I'm confident that someone can fine-tune an end-to-end vision-tranformer that can extract user interface elements from photos and enumerate interaction options.

Seems like such an obviously-useful tool and Vit-22B should be able to handle it, or many other Computer Vision tools on Hugging Face... I would've assumed some grad student somewhere is already hacking away at that.

But then also, compute costs are a b**** but generating training data set should be somewhat easy.

Free research paper idea, I guess.

modcowboy t1_jdkz6of wrote on March 25, 2023 at 3:49 AM

Probably would be easier for the LLM to interact with the website directly through the inspect tool vs machine vision training.

MjrK t1_jdm4ola wrote on March 25, 2023 at 12:37 PM

For many (perhaps these days, most) use cases, absolutely! The advantage of vision in some others might be interacting more directly with the browser itself, as well as other applications, and multi-tasking... perhaps similar to the way we use PCs and mobile devices to accomplish more complex tasks

[deleted] t1_jdjk1iy wrote on March 24, 2023 at 9:12 PM

[removed]