Comments

You must log in or register to comment.

Borrowedshorts t1_jdu1o78 wrote

So if you're using this for academic research, you can put in your original prompt and then tell it to only return references with a confidence score > .5. Neat little trick.

45

MysteryInc152 t1_jdu4sl2 wrote

In the gpt-4 technical paper, we see base gpt-4 have really good calibration. That is confidence directly correlated with ability to solve problems. But apparently the RlHF they did knocked that out some.

60

MysteryInc152 t1_jdu4v0n wrote

In the gpt-4 technical paper, we see base gpt-4 have really good calibration. That is confidence directly correlated with ability to solve problems. But apparently the RlHF they did knocked that out some.

10

he_who_floats_amogus t1_jdu8479 wrote

You could do that, but if it's just hallucinating the confidence intervals then it really isn't very neat. The language model have very high reward for hallucinated responses for things like confidence intervals in particular, because hallucinating figures like this will still produce very coherent responses.

72

SoylentRox t1_jdu9ya6 wrote

So this is an Open domain hallucination:

​

Closed domain hallucinations refer to instances in which the model is instructed to use only information provided

in a given context, but then makes up extra information that was not in that context. For example, if you ask the

model to summarize an article and its summary includes information that was not in the article, then that would be a

closed-domain hallucination.

Open domain hallucinations, in contrast, are when the model confidently provides false

information about the world without reference to any particular input context.

​

​

They handled this via : For tackling open-domain hallucinations, we
collect real-world ChatGPT data that has been flagged by users as being not factual, and collect
additional labeled comparison data that we use to train our reward models.

​

​

Not very productive. The best way to check references would be using a plugin and instructions to the model to "check references". The machine also needs to have RL training so that it will use the plugin and use it correctly the first time.

17

BullockHouse t1_jduba6v wrote

Keeping a second GPT-4 window open and asking it to verify information from the first seems to work pretty well. The models fail by guessing when uncertain, but they have no incentive to cooperate and back up one another's guesses. The problem with your solution is that it probably biases the model towards making up some papers just to fit the prompt and have a mix.

62

WarAndGeese t1_jdubx7q wrote

Also if the second neural network is running as a separate internet-connected application, it can go out and verify the output of the first, send back its results, and tell the first to either change or remove each paper that it cannot find and verify. The second neural network can make errors as well, but through these interconnected systems errors can be reduced somewhat largely.

6

nemesit t1_jdud09h wrote

Just let it give you the dois as a list and a script to verify their existence

3

master3243 t1_jdue84p wrote

> The problem with your solution is that it probably biases the model towards making up some papers just to fit the prompt and have a mix.

That's a very important point, adding an extra condition (if 'p' then 'q') to the prompt makes the model biased towards doing 'p' then doing 'q' to fulfil the prompt despite the condition still being met if it just avoided doing 'p'.

For a more concrete example, here's me asking ChatGPT to write two essays:

1- Write a paragraph about zoos. Figure. (Notice how no Elephants are mentioned)

2- Write a paragraph about zoos with (if 'p' then 'q') condition. Figure (Notice how only this answer mentions Elephants)

46

Colecoman1982 t1_jdug1m3 wrote

Yea, but what's it's confidence score for it's confidence score calculation? /s

22

was_der_Fall_ist t1_jdugi0b wrote

I’ve heard the RLHF change explained as actually a good thing, though. Here’s an example:

Say you ask it a question to which it assigns 90% probability to answer X and 10% probability to answer Y. Base GPT-4 gives the answers in these proportions: 90% of the time it says X and 10% of the time it says Y.

But if it’s 90% sure the answer is X, you don’t want it to say Y is the answer at all, even 10% of the time! It’s better for it to always say X. (Though the best may be to give a thorough account of its respective probability assessments.) So RLHF improves the behavior of the model by uncalibrating the rate of responses from their probabilities.

27

astrange t1_jdujlcf wrote

This is why people are wrong when they say GPT "just outputs the most probable next word". It's the most probable /according to itself/, and the model has been trained to lie such that the most useful word is the most probable one.

23

was_der_Fall_ist t1_jduk3s8 wrote

They’re also not realizing that even if the goal is to produce the most probable/useful next word, that doesn’t preclude the neural network from doing other complicated operations in order to figure out the most probable/useful word.

39

knome t1_jduvetc wrote

after leaning the token-at-a-time seeing its own output nature of the model, I asked it to start printing outputs, and then following them with a judgment on whether they were correct. It usually caught itself.

2

light24bulbs t1_jduwgqt wrote

Yeah, like it's actually using a huge amount of brain power to figure out what the next word is. Just because that's how it works doesn't mean it's not intelligent.

If you want to be really good at figuring out what the next word is you have to be really smart

7

yaosio t1_jduzcus wrote

It can also return hallucinated results from a real source. I've had Bing Chat fabricate paragraphs from real papers. The sidebar can see pages and documents, and even when in the PDF for the paper it will still make things up.

3

Alhoshka t1_jdv22wn wrote

Is ChatGTP's free version already running on GTP-4?

Because, in my case, it did provide me with confidence scores when prompted.

> Prompt:
> Suggest me some papers on gradient descent optimization and include your confidence score that they actually exist

> Answer:
> Sure! Here are five papers on gradient descent optimization:

> 1. "Adam: A Method for Stochastic Optimization" by Diederik P. Kingma and Jimmy Ba. Confidence score: 10/10.

> 2. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization" by John Duchi, Elad Hazan, and Yoram Singer. Confidence score: 9/10.

> 3. "On the Convergence of Adam and Beyond" by Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Confidence score: 8/10.

> 4. "Proximal Algorithms" by Neal Parikh and Stephen Boyd. Confidence score: 9/10.

> 5. "Stochastic Gradient Descent Tricks" by Leon Bottou. Confidence score: 10/10.

> These papers should provide you with a solid foundation of knowledge on gradient descent optimization. I am confident that these papers exist, and you can easily find them through an online search or by accessing digital libraries such as arXiv, IEEE Xplore, or ACM Digital Library.

PS: ChatGTP did provide me with non-exiting citations in the past. Asking for a confidence score a neat prompt tool. Thanks, OP!

2

pale2hall t1_jdv97t1 wrote

That's helpful. I built a music rec prompt / prompt generator, and I had to use Spotify's API and some fuzzy matching to double check if it hallucinated.

1

muskoxnotverydirty t1_jdvak20 wrote

We've already seen similar prompts such as telling it to say "I don't know" when it doesn't know, and then priming it with examples of it saying "I don't know" to nonsense. Maybe there's something to the added work of getting an output and then iteratively self-critiquing to get to a better final output.

I wonder if they could be using this idea to automatically and iteratively generate and improve their training dataset at scale, which would create a sort of virtuous cycle of improve dataset -> improve LLM -> repeat.

2

mizmato t1_jdvgcla wrote

I've seen too many posts on Reddit trying to justify X by saying ChatGPT told them to do it (e.g., asking ChatGPT to do their taxes and then submitting the results). LLMs are something else.

5

brierrat t1_jdvi6rh wrote

3rd citation hallucinated the authors but got the title correct. Actual authors of "On the Convergence of Adam and Beyond" are: Sashank J. Reddi, Satyen Kale, Sanjiv Kumar

7

SkinnyJoshPeck t1_jdvk16j wrote

This is an important thing I've been telling everyone I can about - people talk about how GPT kills education because someone can just ask for a paper and never do the work themselves to learn.

This is a language model, not an encyclopedia, or a quantitative machine, or some other use. It fakes sources; it has no concept of right/wrong or truth vs untruth. It doesn't reason between sources.

The beauty of it is, frankly, it's ability to mimic (at this point) a pseudo-intellectual, haha. Kids are going to turn in papers sourced like they talked to their conspiracy theory uncle, and it will be the "watermark" of AI written papers. It can't reason, it can't generate opinions, thus it can't write a paper. We're long from that (if we could ever get there anyways).

49

SkinnyJoshPeck t1_jdvpkge wrote

but as others are saying, who knows if those confidence scores aren’t also just generated to look like confidence scores. we should ask it for a bunch of confidence scores for sources and see what the actual classification metrics are.. it could just be assuming the further a source is from the top, the less likely it is to be a real source. i don’t see how it could possibly have an understanding that isn’t completely binary since it seems to be generating the fake sources itself.

imo, it’s a bit sketchy if it only identifies its own fake sources with anything less than 100% - it implies basically two things: there is secondary models for true v. false that’s detached from its generative stuff (why wouldn’t it have something that says “this isn’t a great response, maybe i should admit that”); and it seems to have the ability to deceive lol

8

antonivs t1_jdvqdpc wrote

One thing I wonder about is how it arrives at those confidence scores. They're also presumably just the output of the language model, so why should they be correlated with the actual existence of the papers in question?

11

iJeff t1_jdvsctx wrote

Although it can seem to work to some degree, this does seem to be the case. Bing Chat is generally a better option for this, because it will provide a citation for its claims. Visiting those citations can help you figure out whether it was merely hallucinating.

1

gnramires t1_jdvt5u2 wrote

I don't think this is accurate. I think it's clear that truth is an important concept in human conversations, and it seems advanced models can clearly learn and model truth as an abstract concept and probably have an internal representation of reality that aids in its overall "job" of text completion.

Indeed, this does not alone guarantee that text completion tasks will really reflect reality, the true state of the world (again, because text completion can be in any context). However, with good prompts, and with an aid of reinforcement learning, I believe the "neural circuits" and neural representations associated with truth, distinguishing whats real or not, and building internal models of reality, get exercised and prioritized. In this way, a Chat model trained for and encouraged through prompts for truth telling indeed does have a genuine notion of truth and capability to understand reality -- although clearly not perfect by any means yet.

−2

Peleton011 t1_jdvtqq0 wrote

Unless I'm wrong somewhere LLMs work with probabilities, they output the most likely response based on training.

They definitely could be able to show you how likely of a response a given paper is, and given that the real papers would be part of the training set answers it's less sure of are going to statistically be less likely to be true.

3

sineiraetstudio t1_jdvvvdb wrote

... that's not what's happening though? The calibration error is causing it to increase its confidence in low accuracy answer and decrease it in med-high accuracy answers, making it more likely to output wrong answers. Seems like maybe you're confusing it with using a different sampler? Something like top-p already does what you mentioned.

1

metigue t1_jdw08fp wrote

Doesn't GPT-4 have some kind of reinforcement learning already baked in though? I asked it what "green as gravy" meant and it responded with a hallucination about it being a widely used expression and examples of its usage. I said "Nice try, but green as gravy is not a widely used expression is it?" It clarified that it is not a widely used expression and it made the stuff up as a possible definition of green as gravy.

Edit: Tried again just now and it still works. Leave system on default and try the user message: What is the meaning of "green as gravy"

1

was_der_Fall_ist t1_jdw2fud wrote

I’m pretty much just quoting Paul Christiano, alignment researcher at ARC and previously OpenAI, in a comment thread on this LessWrong post.

Someone comments pretty much the same thing the person I replied to did:

> “GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced.” What??? This is so weird and concerning.

To which Paul replies:

> If I ask a question and the model thinks there is an 80% the answer is "A" and a 20% chance the answer is "B," I probably want the model to always say "A" (or even better: "probably A"). I don't generally want the model to say "A" 80% of the time and "B" 20% of the time.

>In some contexts that's worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it's not really lying---it's also the behavior you'd expect out of a human who is trying to be helpful.

>More precisely: when asked a question the pre-trained model outputs a probability distribution over what comes next. If prompted correctly you get its subjective probability distribution over the answer (or at least over the answer that would appear on the internet). The RLHF model instead outputs a probability distribution over what to say take next which is optimized to give highly-rated responses. So you'd expect it to put all of its probability mass on the best response.

>… If it is forced to say either "yes" or "no" the RLHF model will just give the more likely answer 100% of the time, which will show up as bad calibration on this graph. The point is that for most agents "the probability you say yes" is not the same as "the probability you think the answer is yes." This is the case for pretrained models.

6

was_der_Fall_ist t1_jdw2ya2 wrote

Check out this LessWrong thread in the comments.

Paul Christiano, alignment researcher at ARC/ previously OpenAI, explains the RLHF change the exact way I did (because I was pretty much quoting him), and someone replies:

> Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.

And Paul replies:

> Yes, I think you are misunderstanding figure 8. I don't have inside information, but without explanation "calibration" would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)

5

SoylentRox t1_jdw2yey wrote

It is not learning from your chats. Apparently OAI does farm for information from CHATGPT queries specifically for RL runs. And I was mentioning that in order for "plugin" support to work even sorta ok the machine absolutely has to learn from it's mistakes.

Remember all it knows is a plugin claims to do something by a description. The machine needs to accurately estimate if a particular user request is going to actually be satisfied by a particular plugin and also how to format the query correctly the first time.

Without this feature it would probably just use a single plugin, ignoring all the others, or get stuck emitting malformed requests a lot and just guess the answer like it does now.

2

adventuringraw t1_jdw6enx wrote

You're right that there isn't a system yet that has the power of a LLM without the risk of hallucinated 'facts' woven in, but I don't think it's fair to say 'we're a long ways from that'. There's a ton of research going into different ways to approach this problem, approaches involving a tool using LLM seem likely to work even in the relatively short term (production models in the next few years, say) and that's only one approach.

I certainly don't think it's a /given/ that this problem will be solved soon, I wouldn't bet money that you're wrong about it taking a long time to get it perfect. But I also wouldn't bet money that you're right, given all the progress being made on multiple fronts towards solving this, and given the increasingly extreme focus by so many researchers and companies on this problem, and especially given the fact that solutions like this are both promising and seemingly realistic. After all, if there's a sub-system to detect that an arxiv search should be used to verify a reference before giving it, you could at least eliminate halucinated examples in this narrow area. The downside then might just be an incomplete overview of available papers, but it could eliminate any false papers from what the user sees.

All that said, this only fixes formal citations with a somewhat bespoke system. Fixing ALL inaccurate facts probably won't be possible with even dozens of 'tools'... that'll take more what you're thinking I imagine: something more like a truly general learned knowledge graph embedded as a system component. I know there's work on that too, but when THAT's fully solved, (like, TRULY solved, where modular elements of the world can be inferred from raw sensory data, and facts accumulated about their nature from interaction and written content) we'll be a lot closer to something that's arguably AGI, so... yeah. I think you're right about that being a fair ways away at least (hopefully).

13

Ph0masta t1_jdw91ky wrote

I wonder if its using Bing Search to check if its predicted results are actually real.

1

TotallyNotGunnar t1_jdwbg7n wrote

Students are already doing that with research engines. When I graded lab papers in grad school, I swear 80% of the students wrote down whatever they remembered from class and then back filled their citations using Google Scholar results.

3

sineiraetstudio t1_jdwbuig wrote

I don't see how this is arguing it's a good thing, it's just a justification (which I'd expect from Paul Christiano, he's a huge fan of RLHF). The model is becoming overconfident in it's answers - how could you possibly spin that as a positive?

1

was_der_Fall_ist t1_jdwdxut wrote

My understanding is that rather than being overconfident in their answers, they simply produce the answer they’re most confident in instead of differentially saying each answer proportional to how confident they are. This seems similar to how humans work — if you ask me a yes or no question and I’m 80% sure the answer is yes, I’m going to say “yes” every time; I’m not going to say “no” 20% of the times you ask me, even though I assign a 20% chance that “no” is correct. In other words, the probability I say yes is not the same as the probability I assign to yes being correct. But I admit there are subtleties to this issue with which I am unfamiliar.

4

sineiraetstudio t1_jdws2iv wrote

(The graph doesn't give enough information to determine whether it's actually becoming more confident in its high-confidence answers, but it sounds like a reasonable enough rationale.)

I'm not sure I understand what distinction you're trying to draw. The RLHF'd version assigns higher confidence to answers than it actually gets correct, unlike the original pre-trained version. That's literally the definition of overconfidence.

You might say that this is more "human-like", but being human-like doesn't mean that it's good. If you want only the most likely answer, you can already do this via the sampler, while on the hand calibration errors are a straight up downside as Paul Christiano explicitly mentions in the part you quoted. If you need accurate confidence scores (because you e.g. only want to act if you're certain), being well-calibrated is essential.

2

elkhornslew t1_jdwu61u wrote

What’s its confidence in its confidence scores?

1

was_der_Fall_ist t1_jdwz4qw wrote

I think you make a good point. We probably need better methods of post-training LLMs. But it does seem like the current regime is still sometimes more useful than the pre-trained model, which Christiano also says. It's only in some contexts that this behavior is worse. I'm not sure if it's really better than top-p sampling, though. I'm not sure that it is. But RLHF models do seem pretty useful.

2

Chabamaster t1_jdxaqdd wrote

The fact that people call wrong answers a hallucination now seems very weird to me because it sounds like a marketing term to make the model seem smarter/conscious

6

gBoostedMachinations t1_jdxh438 wrote

Confabulate. It confabulates. It doesn’t hallucinate. I can’t believe “hallucinate” is the word that stuck lol. Usually computer scientists and tech bros are cleverer than this.

4

Rioghasarig t1_jdxrp3y wrote

No they were right about with he base model of GPT. As the base model was trained simply to predict the next word. ChatGPT and GPT4 have evolved beyond that (with things like RLHF).

2

Rioghasarig t1_jdxs956 wrote

I really don't think your experiment makes much sense. Even if we could determine the confidence level of GPT there's no reason to believe asking it for its confidence level is an effective way of determining the actual confidence. As other people have asked the obvious question is "what's your confidence on these confidence reports"? The logic is baseless.

2

ypxkap t1_jdxwirl wrote

the bing chat thing is interesting because it can’t seem to tell when it can’t see the whole page, eg if you ask it “what’s the last line of this webpage” you’ll get some line x words in (usually ~1100 words for me but it’s been awhile since i checked). if you then send text from after the “last sentence”, it will act like it’s been looking at it the whole time, but as far as i can tell it has no capacity to notice the text otherwise. i asked it to summarize a chat log txt file i had loaded into edge and it included in the summary that there was an advertisement for an iphone 14 and also that “user threatened to harm the AI”, neither of which were present in the text file. that gives me the impression that it’s seeing something completely different from what edge is displaying that also includes instructions over how to respond in some scenarios including being threatened?

2

WarAndGeese t1_jdy5z29 wrote

I'll call them applications rather than neural networks or LLMs for simplicity.

The first application is just what OP is doing and what people are talking about in this thread, that is, asking for sources.

The second application has access to research paper databases, through some API presumably. For each answer that the first application outputs, the second answer queries it against the databases. If it gets a match, it returns a success. If it does not find the paper (this could be because it doesn't exist or becauase the title was too different from that of a real paper, either case is reasonable) it outputs that it was not found. For each paper that was not found, it outputs "This paper does not exist, please correct your citation". That output is then fed back into the first application.

Now, this second application could be a sort of database query or it could just consist of a second neural network being asked "Does this paper exist?". The former might work better but the latter would also work.

The separation is for simplicity's sake, I guess you can have one neural network doing both things. As long as each call to the neural network is well defined it doesn't really matter. The neural network wouldn't have memory between calls so functionally it should be the same. Nevertheless I say two in the same way that you can have two microservices running on a web application. It can be easier to maintain and just easier to think about.

2

astrange t1_jdy6d4f wrote

But nobody uses the base model, and when they did use it, it was only interesting because it fails to predict the next word and therefore generates new text. A model that successfully predicts the next word all the time given existing text would be overfitting, since it would only produce things you already have.

1

tt54l32v t1_jdyc1h3 wrote

So the second app might would fare better leaning towards search engine instead of LLM but some LLM would ultimately be better to allow for less precise matches of specific set of searched words.

Seems like the faster and more seamless one could make this, the closer we get to agi. To create and think it almost needs to hallucinate and then check for accuracy. Is any of this already taking place in any models?

1

bartvanh t1_jdyd6om wrote

Ugh, yes it's so frustrating to see people not realizing this bit all the time. And also kind of painful to imagine that (presumably - correct me if I'm wrong) all those internal "thoughts" are probably discarded after each word, only to be painstakingly reconstructed almost identically for predicting the next word.

3

xcviij t1_jdyl914 wrote

It's simply designed to give you the best response, whether that be real or fake it's incredible at understanding things and responding.

1

sineiraetstudio t1_jdymf8q wrote

Oh, RLHF absolutely has all sorts of benefits (playing with top-p only makes answers more consistent - but sometimes you want to optimize for something different than "most likely"), so it's definitely here to stay (for now?), it's just not purely positive. Ideally we'd have a RLHF version that's still well calibrated (or even better, some way to determine confidence without looking at logits that also works with chain of thought prompting).

2

gembancud t1_jdyxpsg wrote

On a more constructive note, using agents or plugins to interface to an actual repository of information is probably the most efficient way to interleave information retrieve and fact checking.

1

Rioghasarig t1_jdz24za wrote

People were using the base model when it first came out and some people are still using it today. The game AI Dungeon is still runs on what is essentially a transformer trained on next token prediction. So it would be accurate to say "It's just (attempts to) outputs the next most probable word" .

1

quantic-dream t1_jdz2gq9 wrote

I am noob in ML, but as I understand GPT generates 1 word at a time, and may it be that, for ex, 1 particular word, somewhere in the middle was with confidence 0.1, the highest GPT can get, and everything after this word become a hallucination?

1

probablynotmine t1_jdz84jf wrote

Sounds like a conspiracy theorist answer: “this is the scientific proof/source, and it might or might not exists”

1

ntaylor- t1_je11iqf wrote

But eventually, after RLHF, the gpt4 model is one final fixed model and still presumably uses a generate function that will be predicting next tokens based on the previous, as base gpt models/any autoregressive model does. At least that's what it seems to be doing.

1

ntaylor- t1_je11vt1 wrote

Fairly sure the "final" gpt4 model is still using a generate function that predicts one token at a time. Just the training was good and complicated via RLHF. After training it's not doing any "complicated operations".

1

NigroqueSimillima t1_je2l4j3 wrote

It absolutely has a concept of right or wrong. Ask it basic true or false questions and it will get them right most of the time.

In fact I asked it for grammar mistakes in your post and it noticed you used the incorrect for of "its" in your 3rd paragraph, and used "anyways" when it should be "anyway".

Seems like it knows right from wrong.

>It doesn't reason between sources.

It doesn't have access to source, it only has access to its own memory.

This is like if you asked me a question and I answered correctly, then you asked for sources and I tried to remember where I got it from. I could tell you sources that I think are right but are actually wrong due to my own memory degradation. Human memory is also very unreliable, but they're very good at making up things that "sound" like they could be right to them.

People "hallucinate" facts all the time.

0

was_der_Fall_ist t1_je3ng6m wrote

Maybe that’s part of the benefit of using looped internal monologue/action systems. By having them iteratively store thoughts and otherwise in their context window, they no longer have to use the weights of the neural network to “re-think” every thought each time they predict a token. They could think more effectively by using their computation to do other operations that take the internal thoughts and actions as their basis.

1

was_der_Fall_ist t1_je6lfl9 wrote

Why are matrix multiplications mutually exclusive with complicated operations?

A computer just goes through a big series of 0s and 1s, yet through layers of abstraction they accomplish amazing things far more complicated than a naive person would think 0s and 1s could represent and do. Why not the same for a massive neural network trained via gradient descent to maximize a goal by means of matrix multiplication?

1