GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?
Other MLLMs have it though, e.g. One-For-All. I guess it's only a matter of time before we can get MLLMs to provide a layer of automation over desktop applications...
BinarySplit t1_jdh9zu6 wrote
Reply to [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-
GPT-4 is potentially missing a vital feature to take this one step further: Visual Grounding - the ability to say where inside an image a specific element is, e.g. if the model wants to click a button, what X,Y position on the screen does that translate to?
Other MLLMs have it though, e.g. One-For-All. I guess it's only a matter of time before we can get MLLMs to provide a layer of automation over desktop applications...