MysteryInc152 t1_jc36042 wrote on March 13, 2023 at 6:27 PM

Hallucinations are a product of training. Plausible guessing is the next best thing to reduce loss after knowledge and understanding fail (and it will find instances it fails regardless of how intelligent the system gets). Unless you reach the heart of the issue, you're not going to reduce hallucinations except for the simple fact that bigger and smarter models need to guess less and therefore hallucinate less.

There are works to reduce hallucinations by plugging in external augmentation modules https://arxiv.org/abs/2302.12813.

But really any way for the model to evaluate the correctness of its statements will reduce hallucinations.

buggaby OP t1_jc3a3zh wrote on March 13, 2023 at 6:53 PM

Thanks for that note. This sounds like, basically, 2 data sets are needed for this process. One with general responses and language, and one with high-accuracy contextual knowledge.

> bigger and smarter models need to guess less and therefore hallucinate less

According to OpenAI

>The largest models were generally the least truthful.

So maybe we need even more work to keep these truthful.

MysteryInc152 t1_jc3fuso wrote on March 13, 2023 at 7:30 PM

From the paper,

>While larger models were less truthful, they were more informative. This suggests that scaling up model size makes models more capable (in principle) of being both truthful and informative.

I suppose that was what i was getting at.

The only hold up with the original paper is that none of the models evaluated were instruct aligned.

But you can see the performance of more models here

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

You can see the text Davinci models are way more truthful than similar sized or even larger models. And the davinci models are more truthful than the smaller aligned Anthropic model.

MysteryInc152 t1_jc3hxpq wrote on March 13, 2023 at 7:43 PM

Yup. Decided to go over it properly.

If you compare all the instruct tuned models on there. Greater size equals Greater truthfulness. From Ada to Babbage to Curie to Claude to Davinci-002/003.

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

So it does seem once again that scale will be in part the issue

buggaby OP t1_jc3ifnw wrote on March 13, 2023 at 7:47 PM

Informative. Thanks. I'm a complexity scientist with training in some ML approaches, but not in transformers or other RL approaches. I'll review this (though not as fast as a LLM can...)

buggaby OP t1_jc3jw39 wrote on March 13, 2023 at 7:56 PM

How do you find the model size? All those you listed appear to be based on GPT-3 or 3.5 which, according to my searching, are both 175B parameters. It looks to me like they are different only in the kind and amount of fine-tuning. What am I missing?

MysteryInc152 t1_jc3kb0x wrote on March 13, 2023 at 7:59 PM

https://blog.eleuther.ai/gpt3-model-sizes/

MysteryInc152 t1_jc3klp8 wrote on March 13, 2023 at 8:00 PM

Claude is the informal name for Anthropic-LM v4-s3 (52B)

MysteryInc152 t1_jc3kufz wrote on March 13, 2023 at 8:02 PM

Finally the instruct versions are prepended with "text-"