The general consensus seems to be that large language models, and ChatGPT in particular, have a problem with accuracy and hallucination. As compared to what, is often unclear, but let's say as compared to other NLP methods of question answering, language understanding or as compared to Google Search.

I haven't really been able to find any reliable sources documenting this accuracy problem, though.

The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD nothing.

So, I'm curious:

Is there any benchmark or metric reflecting the seeming step-function made by ChatGPT that's got everyone so excited? I definitely feel like there's a difference between gpt-3 and chatGPT, but is it measurable or is it just vibes?
Is there any metric showing ChatGPT's problem with fact hallucination and accuracy?
Am I off the mark here looking at question-answering benchmarks as an assessment of LLMs?

Thanks

Comments

You must log in or register to comment.

Jean-Porte t1_j6hif9e wrote on January 30, 2023 at 10:50 AM

T5 is fine-tuned on supervised classification. Trained to output labels. That's why it outperforms GPT3.

Generative models are not as good as discriminative models for discriminative tasks. A carefully tuned Deberta is probably better than chatGPT. But ChatGPT has a user-friendly text interface. And the glue-type evaluation is not charitable to chatGPT capabilities. The model might internally store the answer but it could be misaligned to the benchmark.

I always wonder why we don't try to scale-up discriminative models. Deberta-xxlarge is "only" 1.3B parameters, and it outperforms T5 13B.

mettle OP t1_j6ilm6q wrote on January 30, 2023 at 4:27 PM

Is there some alternative benchmark that measures factual accuracy of output?

Or is that impossible to use and create because any model would overfit that data?

Jean-Porte t1_j6imfho wrote on January 30, 2023 at 4:32 PM

LAMA, truthfulQA, MMLU, and many others

mettle OP t1_j6imy6h wrote on January 30, 2023 at 4:35 PM

perfect, thank you!

fmai t1_j6hjauf wrote on January 30, 2023 at 11:02 AM

GPT-3 ranks relatively low on SuperGLUE because it was not finetuned on the SuperGLUE tasks, whereas T5, etc. were. The amazing feat about GPT-3 is that you can reach impressive performance with just few-shot prompting, which was unknown before.

As to your questions:

AFAIK, OpenAI hasn't published any numbers themselves and nobody outside of OpenAI has API access to ChatGPT yet, making it difficult to assess its performance on often thousands of examples from a benchmark. So, no, so far the performance improvement hasn't been quantified.
No, there is no quantitative analysis. Most people seem to agree that, anecdotally, ChatGPT seems to hallucinate far less than GPT-3. But you can definitely get ChatGPT to generate bullshit if you keep digging, so it's far from perfect. Depending on what story you want to tell, some people will emphasize one or the other. Take it all with a grain of salt until we get solid numbers.
AFAIK, LLMs are fantastic at closed-book question answering, where you're not allowed to look at external resources. I think a T5 based model was the first to show that it can answer trivia questions well from knowledge stored in the model parameters only. For open-book QA you will need to augment the LLM with some retrieval mechanism (which ChatGPT doesn't have yet), and therefore you can expect other models to be much better in this regard.

mettle OP t1_j6im3ap wrote on January 30, 2023 at 4:30 PM

Thanks for this thoughtful answer.

Re: 2, are there solid numbers we would conceptual even be able to get? Are there known ongoing efforts?

EmmyNoetherRing t1_j6i8xfv wrote on January 30, 2023 at 3:02 PM

I hate to say it, but I think the actual answer to “as compared to what” is “as compared to my human professor”.

People using it to learn are having interactions that mimic interactions with teachers/experts. When they mention hallucinations, I think it’s often in that context.

mettle OP t1_j6im95b wrote on January 30, 2023 at 4:31 PM

this is true so far, it would seem.

you'd think there'd be some clever folks trying to quantify things better.

EmmyNoetherRing t1_j6j7zq4 wrote on January 30, 2023 at 6:46 PM

I wouldn’t mind being one of those folks. But you make a good point that the old rubrics may not be capturing it.

If you want to nail down what users are observing as its comparison to human performance, practically speaking you may need to shift to diagnostics that were designed to evaluate human performance. With the added challenge of avoiding tests where the answer sheet would already be in its training data.

currentscurrents t1_j6jbokk wrote on January 30, 2023 at 7:09 PM

I think hallucination occurs because of the next-word-prediction task on which these models were trained. No matter how good a model is, it can never predict the irreducible entropy of the sentence - the 1.5 bits per word or whatever that contains the actual information content. The best it can do is guess.

This is exactly what hallucination looks like; all the sentence structure is right, but the information is wrong. Unfortunately, this is also the most important part of the sentence.

mettle OP t1_j6jgkz8 wrote on January 30, 2023 at 7:40 PM

Sure, but the question is how often does it happen to get the right answer vs. the wrong answer and how would be measure that.

andreichiffa t1_j6mdm66 wrote on January 31, 2023 at 10:27 AM

On a very high level, transformer-derived architectures struggle with the concept of reality because they need distributions in the token embedding space to remine wide. Especially for larger model, the training data is so sparse that without that they would struggle with generalization and exposure biais.

Repeated prompting and prompt optimization can pull out elements of training set from it (in some cases), because in the end they do memorize, but the exact mechanism is not yet clear and cannot be counted on.

You can go around it by adding a « critic » post-processor that would classify if model tries to mention a fact, look it up, and force it to re-generate until statement is factually correct. This is very close to GeDi, the Guided Generation introduced by a Salesforce team back in 2020. Given that OpenAI went this route for ChatGPT and InstructGPT to make them less psycho and more useful to the end users (+ iterative fine-tuning from user's and critic model input), there is a good chance they will go this route as well.

You can also add discrete non-differentiable layers to train model to recognize factual statements from others in-text text and learn to switch between the modes allowing it to process them differently. However, you loose nice back-propagation properties and have to do black-box optimization on discrete layers, which is costly, even by LLM standards. That seems to be the Google approach with PaLM.

Blutorangensaft t1_j6mdw93 wrote on January 31, 2023 at 10:31 AM

Is the critic used for fine-tuning or as a part of the loss function during training?

andreichiffa t1_j6mojfv wrote on January 31, 2023 at 12:38 PM

Most likely as a post-processor, along the lines of guided generation; pretty much the GeDi proposed by Salesforce in 2020.

bitRAKE t1_j6mj7s2 wrote on January 31, 2023 at 11:40 AM

Ask ChatGPT for an explanation of anything without a known correct answer, and then tell it that "that answer is incorrect". It will proceed to dream up a new answer. This could be non-existent syntax for a programming language, for example. The sequential nature of the model means it can paint itself into a corner quite easily.
Isn't knowledge accuracy a by-product of modeling correct language use to some degree, and not the design goal of the system? A fantasy story is just as valid a language use as a research paper. Accuracy seems to correlate with how the system is primed for the desired context.