Relevant_Ad7319 t1_jee3ah6 wrote on March 31, 2023 at 10:40 AM

Reply to comment by basilgello in Language Models can Solve Computer Tasks (by recursively criticizing and improving its output) by rationalkat

Task demonstrating in form of screen recordings? It says their approach only needs a few examples but Chatgpt doesn’t even work with videos as input right?

basilgello t1_jeecmqt wrote on March 31, 2023 at 12:18 PM

Correct, GPT4 is not meant to accept videos as input. And probably not screencasts but explained step-by-step prompts. For example, look at page 18 table 6: it is LangChain-like prompt. First, they define actions and tools and then language model puts the output which is actually high-level API call in some form. Using RPA as API, you get mouse clicker based on HTML context. Another thing HTML pages are crafted manually, and system still does not understand the unseen pages.

SgathTriallair t1_jeertpn wrote on March 31, 2023 at 2:16 PM

Given that it can accept images, they may be able to shoehorn videos in. The next version we use as a base will need multi modality equal to humans (i.e. all of our senses) in order to relocate all of what we do.