Submitted by mettle t3_10oyllu in MachineLearning
The general consensus seems to be that large language models, and ChatGPT in particular, have a problem with accuracy and hallucination. As compared to what, is often unclear, but let's say as compared to other NLP methods of question answering, language understanding or as compared to Google Search.
I haven't really been able to find any reliable sources documenting this accuracy problem, though.
The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD nothing.
So, I'm curious:
- Is there any benchmark or metric reflecting the seeming step-function made by ChatGPT that's got everyone so excited? I definitely feel like there's a difference between gpt-3 and chatGPT, but is it measurable or is it just vibes?
- Is there any metric showing ChatGPT's problem with fact hallucination and accuracy?
- Am I off the mark here looking at question-answering benchmarks as an assessment of LLMs?
Thanks
Jean-Porte t1_j6hif9e wrote
T5 is fine-tuned on supervised classification. Trained to output labels. That's why it outperforms GPT3.
Generative models are not as good as discriminative models for discriminative tasks. A carefully tuned Deberta is probably better than chatGPT. But ChatGPT has a user-friendly text interface. And the glue-type evaluation is not charitable to chatGPT capabilities. The model might internally store the answer but it could be misaligned to the benchmark.
I always wonder why we don't try to scale-up discriminative models. Deberta-xxlarge is "only" 1.3B parameters, and it outperforms T5 13B.