Submitted by Ok-Variety-8135 t3_11c9zum in singularity

Imagine a language model that can communicate in a special language, whose language tokens are made up of two parts of information:

  1. Sensation (like seeing, hearing, touching, etc.)
  2. Actuator state (like controlling the body's movements or speaking).

When the model is speaking, the predicted token guides a robot’s behavior. Sensation part become the imagination/thought of the robot, actuator state decides the movement of the robot. If the actuator state contains microphone state, then the robot is actually speaking.

When the model is hearing, the next token will come from the robot body, the robot’s environment sensors and actuator sensors.

And the model will decide when to speak and when to hear by itself.

All tokens, whether spoken or heard, forming a “conversation history”. The history will be evaluated by a reward model that defines the purpose of the robot.

The model will update its weights continuously using reinforcement learning and the evaluated conversation history.

Old conversation history can be deleted after being encoded into model weight.

​

In short, a “ChatGPT” that models the language between brain and body.

12

Comments

You must log in or register to comment.

HeinrichTheWolf_17 t1_ja2nmcd wrote

I don't think it has to be AGI for it to kick off the Singularity.

7

HeinrichTheWolf_17 t1_ja3qwsi wrote

I wouldn’t say it’s AGI, but I would also say you don’t need AGI to kick off the Singularity. 😉

6

turnip_burrito t1_ja2f02y wrote

Sure, you can do it if you have enough data, and a powerful enough computer.

Idk how you're going to do reinforcement learning to update the transformer weights though (I assume you want to use a transformer?). That's a lot of computation. The bigger your model is, the slower this update step will be.

Are you separating hearing and speaking/moving in time? Like are they separate steps that can't happen at the same time? My question then is why not make them simultaneous?

5

visarga t1_ja3637d wrote

A recent approach saves past experience data and loads it back for in-context-learning. The model itself can be task generic. So it learns by collecting new data.

6

Kolinnor t1_ja2hvkm wrote

My (non-expert) take :

The problem is that there are many black boxes with that.

LLMs work well when we have a huge amount of data to train the model with. In an oversimplified way, LLMs predict the next word, based on the previous data they've seen. But how to "predict the next action you'll take" ? If we had a massive amount of "sensation --> action" data (probably just like the human brain accumulates during life ?) then that would be possible. I haven't heard of a way to achieve that today, and I think it's more complicated than that anyways.

I think what's your suggesting is kinda like what they try to do with Google's SayCan : but as you can see, for the moment there's no easy way to link LLMs with physical action. LLMs manage to create plausible scenarios of what's happening, or what could be some consequences of action X, but practically it's not usable yet.

There's also the fact that, as someone pointed earlier, there are issues with continuous learning, such as catastrophic forgetting. I think many brilliant minds are actively trying to surpass those issues, but it's no easy feat.

3

WikiSummarizerBot t1_ja2hwrn wrote

Catastrophic interference

>Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information. Neural networks are an important part of the network approach and connectionist approach to cognitive science. With these networks, human capabilities such as memory and learning can be modeled using computer simulations. Catastrophic interference is an important issue to consider when creating connectionist models of memory.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

2

visarga t1_ja36ih0 wrote

We can have a model trained on a large video dataset, and then fine-tuned for various tasks like GPT3.

Using YouTube as training data we'd get video + audio which decompose in image, movement, body pose, intonation, text transcript, metadata all in parallel. This dataset could dwarf the text datasets we have now, and it will have lots of information that doesn't get captured in text, such as physical movements for achieving a specific task.

I think the OP was almost right. The multi-modal AI will be a good base for the next step, but it needs instruction tuning and RLHF. Just pre-training is not enough.

One immediate application I see - automating desktop activities. After watching many hours of screen casting from YT, the model will learn how to use apps and solve tasks at first sight like GPT-3.5, but not limited to just text.

2

Professional-Song216 t1_ja59fx0 wrote

If it can learn new skills almost on the fly, proto agi but I that that may take a bit of time

2

Nervous-Newt848 t1_ja5miqj wrote

First off... Very interesting... But just so you know that wouldnt be a language model anymore

They dont really have a term for that other than multimodal... Multimodal world model???

Models cant speak or hear when they want to Its just not part of their programming

They respond to input

So if they are receiving continuous input... Theoretically they should be continuously outputting...

The whole conversation history could be saved into a database

Reward models are currently given texts with scores made by humans its called RLHF, or Reinforcement Learning from Human Feedback... AI doesnt do the scoring... That's for language models though...

How could they know what is good and what is bad???

Now for world models reinforcement learning works differently... Which is probably what youre referring to... I wont go into it because its pretty complex...

Updating its weights continuously is currently impossible due to an energy inefficiency problem with the von Neumann hardware architecture... Basically traditional cpus and gpus... More basically, it requires too many computations and too much electricity to continuously "backpropagate" (data science word) data input...

Conversations shouldn't be encoded into a language model either imo... because of "hallucinations" they may make some things up that didn't happen

Querying a database of old conversations is better and will always be more accurate

In order for an AGI to truly be AGI by definition it needs to be able to learn any task... This is currently possible manually server side through manual backpropagation... But this is not possible continuously like how human brains work...

Humans continuously learn...

An AI neural network manually learns by being fed data through a command line interface... This is called "training"... Data science terminology

An AI neural network model is then "deployed" aka opened and ran on a single gpu or multiple depending on model size... When a language model is running it is said to be in "inference mode"... More terminology

We need an entirely different hardware architecture in order to run AI Neural Networks in training and Inference mode simultaneously...

Photonics or Neuromorphic computing, perhaps a combination of both... These seem like the way forward in my opinion

2

turnip_burrito t1_ja5y3hb wrote

I agree with all of this, but just to be a bit over-pedantic on one bit:

> Models cant speak or hear when they want to Its just not part of their programming.

As you said it's not part of their programming, in today's models. In general though, it wouldn't be too difficult to construct a new model that judges at each timestep based on both external stimuli and internal hidden states when to speak/interrupt or listen intently. Actually at first glance such a thing sounds trivial.

1

Borrowedshorts t1_ja3hcfq wrote

Yes probably. You don't need to learn anything to be generally intelligent if you've already been trained on the entirety of human knowledge.

1

AsheyDS t1_ja3hr4s wrote

How does it generalize across tasks, concepts, etc?

1

Lawjarp2 t1_ja3o9i0 wrote

It should have persistence. That is hard to achieve when the model is big and slow. Especially if the model gets slower because of adding persistence and multimodality

1

Superschlenz t1_ja653um wrote

Where would you get one trillion tokens for touch and actuator state (proprioception) from?

As another commenter said, you also forgot reward/goal/feelings.

1