harharveryfunny t1_irn8jzr wrote on October 9, 2022 at 3:40 PM

Reply to comment by SejaGentil in [D] Why can't language models, like GPT-3, continuously learn once trained? by SejaGentil

The way a neural net learns is by comparing the neural net's current output, for a given input, to the correct/preferred output, then "back-propagating" this difference (error) information backwards though the net to incrementally update all the weights (that represent what it has learned).

During training you know the correct/preferred output for any input since this is provided by your training data, which consists of (input, output) pairs. For each training pair, and corresponding output error, the network's weights are only updated a *little* bit, since you want to take ALL the training samples into account. You don't want to make the net totally correct for one sample at the expense of being wrong for the others, so the way this is handled is by repeating all the training samples multiple times with small updates until the net has been tweaked to minimize errors for ALL of them.

If we're talking specifically about a language model like GPT-3, then the training data consists of sentences and the training goal is "predict next word" based on what it has seen so far. For example, if one training sample was the sentence "the cat sat on the mat", then after having seen "the cat" the correct output is "sat", and if the net has seen "the cat sat on the", then the correct output is "mat".

So, with that background, there are two problems to having GPT-3 learn continuously, not just during training:

When you are (after training) using GPT-3 to generate text, you have no idea what possible words it should be outputting next! You start with an initial "prompt" (sentence) and use GPT-3 to "predict next word", then if you want another word of output you take that first generated word and feed it back in, and now have GPT-3 predict the *next* word, etc. This isn't like training where you already know the whole sentence ahead of time - when actually using the model you are just generating one word at a time, and have no idea what *should* come next. Since you have no idea of what is correct, you can't derive an error to update the model.
Even if you somehow could come up with an error for each word that GPT-3 generates, then how much should you update the weights? Just like during training you don't want to make a big update and make the net correct for the current input but wrong for all other inputs, but unlike training you can't handle this by just updating a little and then re-presenting the entire training set (plus whatever else you've fed into GPT-3 since then) to make updates for those too. This is what another reply is referring to as the "catastrophic forgetting" problem - how would you, after training, continue learning (i.e. continue updating the model's weights) without disrupting everything it has already learned?

The reason our brains *can* learn continuously is because they are "solving" a different problem. Our brain is also learning to "predict next thing", but in this case the "thing" it is predicting is current sensory inputs/etc - the never-ending stream of reality we are exposed to. So, our brain can predict what comes next, and there always will be an actual next "thing" that is happening/being experienced for it to compare to. If our brain was wrong ("surprised") by what actually happened vs what was predicted, then it can use that to update itself to do better next time.

It's not totally clear how our brains handle the "catastrophic forgetting" problem, but it certainly indicates it is using weights in a bit of a different way to our artificial neural networks. It may be related to the idea of "sparsity".

SejaGentil OP t1_irq5qyz wrote on October 10, 2022 at 4:47 AM

I see. I understand all that you said, thanks for the info. I do disagree with the way things are done and feel a little sceptic about our whole approach now, but, of course, not being part of the field, my opinion doesn't matter at all. At least now I understand it from your point of view.

harharveryfunny t1_irrfflr wrote on October 10, 2022 at 2:00 PM

Well, any scientific or engineering field is going to progress from simple discoveries and techniques to more complex ones, and the same applies to artificial neural networks.

If you look at the history of ANNs, they were originally limited to a single layer ("Perceptron") until the discovery of how multi-layer networks (much more powerful!) could be trained via back-propagation and SGD which is what has led to the amazing capabilities we have today.

The history of ANNs is full of sceptics who couldn't see the promise of where the technology was heading, but rather than being sceptical I think today's amazing capabilities should make you optimistic that they will continue to become more powerful as new techniques continue to be discovered.

SejaGentil OP t1_irrtyrx wrote on October 10, 2022 at 3:42 PM

If I may ask a last question, why layers? Why not a graph where each neuron may interact with each other neuron, exactly like the brain? Of course not all edges need to exist, each neuron could have just a few connections to keep the number of synapses controlled; the point is to eliminate the layering, which looks artificial.

harharveryfunny t1_irslrit wrote on October 10, 2022 at 6:48 PM

GPT-3 isn't a layered architecture - the proper name for it is a "transformer". It's highly structured. Nowadays there are many different architectures in use.

The early focus on layers was because there are simple things, such as learning an XOR function, that a single layer network can't do, but originally (50 years ago) no-one knew how a multi-layer network could be trained. The big breakthough therefore was when c. 1980 the back-propagation algorithm was invented which solved the multi-layer "credit assignment" training problem (and also works on any network shape - graphs, etc).

The "modern neural net era" really dates to the ImageNet image recognition competition in 2012 (only 10 years ago!) when a multi-layer neural net beat older non-ANN image recognition approaches by a significant margin. For a number of years after that the main focus of researchers was performing better on this ImageNet challenge with ever more elaborate and deeper multi-layer networks.

Today, ImageNet is really a solved problem, and the focus of neural nets has shifted to other applications such as language translation, speech recognition, generative language models (GPT-3) and recent text-to-image and text-to-video networks. These newer applications are more demanding and have required more sophisticated architectures to be developed.

Note that our brain actually has a lot of structure to it - it's not just one giant graph where any neuron could connect to any other neuron. For example, our visual cortex actually has what is roughly a layered archtecture (V1-V5), which is why multi-layer nets have done so well in vision applications.

SejaGentil OP t1_irstl4q wrote on October 10, 2022 at 7:41 PM

Thanks for this overview, it makes a lot of sense. Do you have any ideas as to why GPT-3, DALL-E and the like are so bad at generating new insights and logical reasoning? My feeling is that these networks are very good at recalling, like a very dumb human that compensated it with a wikipedia-size memory. For example, if I attempt to prompt something like this on GPT-3:

This is a logical question. Answer it using exact, mathematical reasoning.

There are 3 boxes, A, B, C.
I take the following actions, in order:
- I put a 3 balls on box A.
- I move 1 ball from box A to box C.
- I swap the contents of box A and box B.
How many balls are on each box?

It will fail miserably. Trying to teach it any kind of programming logic is a complete failure, it isn't able to get very basic questions right. Asking step by step doesn't help. For me, the main goal of AGI is to be able to teach a computer how to prove theorems in a proof assistant like Agda, and let it be as apt as myself. But GPT-3 is as unapt as every other AI, and it seems like scaling won't do anything about that. That's why, to me, it feels like AI as a whole is making 0 progress towards (my concept of) AGI, even though it is doing amazing feats in other realms, and that's quite depressing. I use GPT-3 Codex a lot when coding, but only when I need to do some kind of repetitive trivial work, like converting formats. Anything that needs any sort of reasoning is out of its reach. Similarly, DALLE is completely unable to generate new image concepts (like a cowboy riding an ostrich, a cow with a duck beak...).

harharveryfunny t1_irtbnwz wrote on October 10, 2022 at 9:47 PM

GPT-3 isn't an attempt at AI. It's literally just a (very large) language model. The only thing that it is designed to do is "predict next word", and it's doing that in a very dumb way via the mechanism of a transformer - just using attention (tuned via the massive training set) to weight the recently seen words to make that prediction. GPT-3 was really just an exercise in scaling up to see how much better (if at all) a "predict next word" language model could get if the capacity of the model and size of the training set were scaled up.

We would expect GPT-3 to do a good job of predicting next word in a plausible way (e.g. "the cat sat on the" => mat), since that it literally all it was trained to do, but the amazing, and rather unexpected, thing is that it can do so much more ... Feed it "There once was a unicorn", and it'll start writing whole fairy tale about unicorns. Feed Codex "Reverse the list order" and it'll generate code to perform that task, etc. These are all emergent capabilities - not things that it was designed to do, but things that needed to learn to do (and evidentially was capable of learning, via its transformer architecture) in order to get REALLY good at it's "predict next word" goal.

Perhaps the most mind blowing Codex capability was the original release demo video from OpenAI where it had been fed the Microsoft Word API documentation, then was able to USE that information to write code to perform a requested task ("capitalize first letter of each word" if I remember correctly)... So think about it - it was only designed/trained to "predict next word", yet is capable of "reading API documentation" to write code to perform a requested task !!!

Now, this is just a language model, not claiming to be an AI or anything else, but it does show you the power of modern neural networks, and perhaps give some insight into the relationship between intelligence and prediction.

DALL-E isn't claiming to be an AI either, and has a simple flow-through architecture. It basically just learns a text embedding which it maps to an image embedding which is then decoded to the image. To me it's more surprising that something so simple works as well as it does, rather than disappointing that it only works for fairly simple types of compositional requests. It certainly will do its best to render things it was never trained on, but you can't expect it to do very well with things like "two cats wrestling" since it has no knowledge of cat's anatomy, 3-D structure, or how their joints move. What you get is about what you'd expect given what the model consists of. Again, its a pretty simple flow thru text-to-image model, not an AI.

For any model to begin to meet your expectations of something "intelligent" it's going to have to be designed with that goal in the first place, and that it still in the future. So, GPT-3 is perhaps a taste of what is to come... if a dumb language model is capable of writing code(!!!), then imagine what a model that is actually designed to be intelligent should be capable of ...

SejaGentil OP t1_iruye6s wrote on October 11, 2022 at 6:10 AM

Just answering to thank you for all the info, I don't have any more question for now.