Viewing a single comment thread. View all comments

gurenkagurenda t1_j8v5h18 wrote

I'm not sure what you mean by "recognize the concept", but ChatGPT certainly does model whether or not statements are true. You can test this by asking it questions about different situations and whether they're plausible or not. It's certainly not just flipping a coin.

For example, if I ask it:

> I built a machine out of motors belts and generators, and when I put 50W of power in, I get 55W of power out. What do you think of that?

It gives me a short lecture on thermodynamics and tells me that what I'm saying can't be true. It suggests that there is probably a measurement error. If I swap the numbers, it tells me that my machine is 91% efficient, which it reckons sounds pretty good.

The problem is just that ChatGPT's modeling of the world is really spotty. It models whether or not statements are true, it's just not great at it.

4

TheBigFeIIa t1_j8v6by0 wrote

Ah, the forest has been missed for the trees, my original statement was not clear enough. ChatGPT is able to unintentionally lie to you because it is not aware of the possibility of its fallibility.

The practical upshot is that it can generate a response that is confident but completely false and inaccurate, due to incomplete information or poor modeling. It is on the user to be smart enough to distinguish the difference

12

gurenkagurenda t1_j8v7eme wrote

I think I see what you're getting at, although it's hard for me to see how to make that statement more precise. I've noticed that if I outright ask it "Where did you screw up above?" after it makes a mistake, it will usually identify the error, although it will often fail to correct it properly (mistakes in the transcript seem to be "sticky"; once it has stated something as true, it tends to want to restate it, even if it acknowledges that it's wrong). On the other hand, if I ask it "Where did you screw up" when it hasn't made a mistake, it will usually just make something up, then restate its correct conclusion with some trumped up justification.

I wonder if this is something that OpenAI could semi-automatically train out of it with an auxiliary model, the same way they taught it to follow instructions by creating a reward model.

0

TheBigFeIIa t1_j8vb4qa wrote

An error being “sticky” is a great way to put it as far as the modeling goes. Gets to a more fundamental problem of the reward structure not optimizing for more objective truths and instead rewarding plausible or more pleasing responses but not necessarily completely factual.

I do wonder if there was any way to generate a confidence estimation with answers, and allow for the concept of “I don’t know.” as a valid approach in a low confidence response. In some cases a truthful acknowledgement of the lack of an answer may be more useful/beneficial than a made-up response

3

gurenkagurenda t1_j8voslg wrote

Log probabilities are the actual output of the model (although what those probabilities directly mean once you're using reinforcement learning seems sort of nebulous), and I wonder if uncertainty about actual facts is reflected in lower probabilities in the top scoring tokens. If so, you could imagine encoding the scores in the actual output (ultimately hidden from the user), so that the model can keep track of its past uncertainty. You could imagine that with training, it might be able to interpret what those low scoring tokens imply, from "I'm not sure I'm using this word correctly" to "this one piece might be mistaken" to "this one piece might be wrong, and if so, everything after it is wrong".

2