Kolinnor t1_ja2hvkm wrote on February 26, 2023 at 10:10 AM

My (non-expert) take :

The problem is that there are many black boxes with that.

LLMs work well when we have a huge amount of data to train the model with. In an oversimplified way, LLMs predict the next word, based on the previous data they've seen. But how to "predict the next action you'll take" ? If we had a massive amount of "sensation --> action" data (probably just like the human brain accumulates during life ?) then that would be possible. I haven't heard of a way to achieve that today, and I think it's more complicated than that anyways.

I think what's your suggesting is kinda like what they try to do with Google's SayCan : but as you can see, for the moment there's no easy way to link LLMs with physical action. LLMs manage to create plausible scenarios of what's happening, or what could be some consequences of action X, but practically it's not usable yet.

There's also the fact that, as someone pointed earlier, there are issues with continuous learning, such as catastrophic forgetting. I think many brilliant minds are actively trying to surpass those issues, but it's no easy feat.

WikiSummarizerBot t1_ja2hwrn wrote on February 26, 2023 at 10:11 AM

Catastrophic interference

>Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information. Neural networks are an important part of the network approach and connectionist approach to cognitive science. With these networks, human capabilities such as memory and learning can be modeled using computer simulations. Catastrophic interference is an important issue to consider when creating connectionist models of memory.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

visarga t1_ja36ih0 wrote on February 26, 2023 at 2:39 PM

We can have a model trained on a large video dataset, and then fine-tuned for various tasks like GPT3.

Using YouTube as training data we'd get video + audio which decompose in image, movement, body pose, intonation, text transcript, metadata all in parallel. This dataset could dwarf the text datasets we have now, and it will have lots of information that doesn't get captured in text, such as physical movements for achieving a specific task.

I think the OP was almost right. The multi-modal AI will be a good base for the next step, but it needs instruction tuning and RLHF. Just pre-training is not enough.

One immediate application I see - automating desktop activities. After watching many hours of screen casting from YT, the model will learn how to use apps and solve tasks at first sight like GPT-3.5, but not limited to just text.