Comments

You must log in or register to comment.

Laser_Plasma t1_j4pws5y wrote

The whole "benchmark" is just a Readme? What is this nonsense

2

mrconter1 OP t1_j4q4o7t wrote

I will upload the data and accompanying website soon. What do you think about the idea?

−3

Laser_Plasma t1_j4q544u wrote

I think ideas are cheap (“benchmark of AGI-like capabilities”), and this particular execution of the idea (closing a window in a browser?) isn’t really good in any way

2

mrconter1 OP t1_j4qctlb wrote

The thing is that there are a lot of other screenshots + instructions as well. What wouldn a system that can get 100% on this benchmark not be able to do?

−3

Dendriform1491 t1_j4qs2sf wrote

Your unconquerable benchmark is below the level of achievement attained by research from 1970

https://en.m.wikipedia.org/wiki/SHRDLU

2

WikiSummarizerBot t1_j4qs4st wrote

SHRDLU

>SHRDLU was an early natural-language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. In the program, the user carries on a conversation with the computer, moving objects, naming collections and querying the state of a simplified "blocks world", essentially a virtual box filled with different blocks. SHRDLU was written in the Micro Planner and Lisp programming language on the DEC PDP-6 computer and a DEC graphics terminal. Later additions were made at the computer graphics labs at the University of Utah, adding a full 3D rendering of SHRDLU's "world".

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

1

navillusr t1_j4qumlu wrote

  1. If you list instructions step by step, the model doesn’t require reasoning to solve the problem. This is testing a very basic form of intelligence.
  2. Adept.ai can already solve more complex challenges than this (but still nowhere near AGI). They use a chatbot to automate simple tasks in common programs using LLMs.
  3. There’s a benchmark that already tests tasks like this, MiniWoB++
1

mrconter1 OP t1_j4rdm59 wrote

Adept AI is restricted to the web and also does not use raw pixels as input...

1

navillusr t1_j4rhitt wrote

The distinctions you’re drawing, pixels vs selenium output and browser vs os, are far less significant than the complexity of the tasks (step-by-step vs entire processes). What they’ve achieved is strictly harder for humans than what you are testing. We can argue whether perception or planning are harder for current technology (the computer vision is far more developed than AI planning right now), but I think you need to reconsider the formulation of your tasks. It seems like they are designed to be easy enough for modern methods to solve.

On another note, most interesting tasks can’t be completed with just an x,y mouse location output. Why did you decide to restrict the benchmark to such a limited set of tasks?

1

mrconter1 OP t1_j4rsus2 wrote

Really appreciate your feedback.

> The distinctions you’re drawing, pixels vs selenium output and browser vs os, are far less significant than the complexity of the tasks (step-by-step vs entire processes). What they’ve achieved is strictly harder for humans than what you are testing. We can argue whether perception or planning are harder for current technology (the computer vision is far more developed than AI planning right now), but I think you need to reconsider the formulation of your tasks. It seems like they are designed to be easy enough for modern methods to solve.

I'm not sure about this. Being able to do the next click on a large diversified benchmark of screenshot is extremely difficult for a computer today. It would need to be able to:

  • Choose the next chess move if I am in a chess application
  • Recognize the color palette icon on the keyboard if I ask it to change the color of the keyboard
  • Recognize the Gmail icon of I say "send an email"
  • Change keyboard mode in if I ask it to write an exclamation mark
  • Press the key "2" if I ask it to type the number equivalent to the number of consuls that traditionally held the office at the same time in ancient Rome.

That's way outside what current models can do. But humans could do it easily. This benchmark would be extremely simple and intuitive for humans to complete (even with far fetched goals) but there is no model today capable of even knowing that you should press on the new line location given a screenshot and "Add line" today.

> On another note, most interesting tasks can’t be completed with just an x,y mouse location output. Why did you decide to restrict the benchmark to such a limited set of tasks?

I wrote about this in the ReadMe. There is no reason. It's just easier to explain the idea for people. I think the most powerful variant of this idea would take a series of frames (video context) and instructions and output something of the following:

  • Click
  • Press (X seconds)
  • Move from P1 to P2 (X seconds)

The benchmark is simple enough to understand and explain so that you can start to envision what such a model would be able to do. Or much more interesting. What would it not be able to do.

If you have any more feedback or thoughts please reply. I wish more people were interested but either the idea sucked or I need to create something interactive for people.

1

blose1 t1_j4td3lq wrote

>Recognize the Gmail icon of I say "send an email"

This is not testing intelligence, this is testing if human was trained on computer usage, knows what e-mail is and used gmail before.

Someone from tribe in Africa would fail your test while he is human and is intelligent, train him on this task like you would train current gen multimodal system and it will pass your benchmark. You train LLM in combination with image model and RL model, train on instruction following using inputs you described and now it understands what it sees, can follow what you want it to do.

1

mrconter1 OP t1_j4tuaal wrote

> This is not testing intelligence, this is testing if human was trained on computer usage, knows what e-mail is and used gmail before.

I don't think it's binary. I think intelligence is a large part here.

> Someone from tribe in Africa would fail your test while he is human and is intelligent,

Could you train a bird to pass all questions on this benchmark? No. Because it's not as intelligent as a human.

> train him on this task like you would train current gen multimodal system and it will pass your benchmark. You train LLM in combination with image model and RL model, train on instruction following using inputs you described and now it understands what it sees, can follow what you want it to do.

Solving this benchmark is an easy problem? How long do you think it will take until we have a model that can causually solve all the instructions a gave in the previous comment?

1

mrconter1 OP t1_j4r8miw wrote

  1. A LLM test does not require reasoning because it generates one word at the time?
  2. It can't.
  3. This might be interesting though.
0

mrconter1 OP t1_j4rdf3t wrote

The MiniWoB++ is restricted to website related things on not OS also it does not take raw pixels as input.

0

navillusr t1_j4rexbm wrote

This is wrong, WoB/MiniWoB++ has a 160 x210px observation. Also some OS’s (chrome OS) are almost entirely web based, so this distinction is minimal.

1

mrconter1 OP t1_j4rintd wrote

Yeah you're right. My approach see to be a bit more general and should be less work.

1