Submitted by buggaby t3_11qgasm in MachineLearning

I just posted this on r/ChatGPT but thought there might be some great thoughts here, too.

ChatGPT generates believable output but, as many have noted, not trustworthy output. A lot of the use cases I see for future generative AI models seem to crucially depend on making believable AND truthful responses. But given that it's probably easier to make believable but non-truth responses (since more of them exist), I imagine that this is a very hard prospect. Is it even possible with current methods?

From my read, modern generative AI models can only increase correctness of output in 2 ways. Using more correct data, and using human labellers for fine-tuning. Having more correct data either requires much smaller datasets (even academic journals can't be considered correct since science evolves over time) or human expertise in correcting the data. So it seems like human expertise remains vital.

Now I know that human labellers were necessary to reduce the toxicity of GPT-3 responses. I read that something like dozens were used over a period of months, though I don't know if this is publicly shared by OpenAI. But how important is human training in driving up "truthfulness" of these models?

I briefly reviewed this paper and it talks about InstructGPT being better than GPT-3 at truthfulness, even with 1/100th of the parameters (1.3B parameters vs 175B of GPT-3). But I also understand that larger models tend to lie more, so that could be part of it. And even though it is "more truthful", the metric used to compare seems suspect to me, especially since "InstructGPT still makes simple mistakes", including making up facts.

It seems here like little improvement in truthfulness.

Without a clear path to increasing this vital metric, I struggle to see how modern generative AI models can be used for any important tasks that are sensitive to correctness. That's still a lot of cool things, but we seem far from even a good search engine, from assisting researchers, or even from coding support. (I have used ChatGPT for this latter purpose, and sometimes it helps me more quickly, but sometimes it makes it slower because it's flat-out false. Stackoverflow is generally much more trustworthy and useful for me so far.) And certainly we are really far from anything remotely "AGI".

8

Comments

You must log in or register to comment.

abriec t1_jc34zx3 wrote

Given the constant evolution of information through time combining LLMs with retrieval and reasoning modules is the way forward imo

14

currentscurrents t1_jc4ev00 wrote

This is (somewhat) how the brain works; language and knowledge/reasoning are in separate structures and you can lose one without the other.

2

visarga t1_jc3wlib wrote

I give you a simple solution: run GPT-3 and LLaMA in parallel, if they concur, then you can be sure they have not hallucinated the response. Two completely different LLMs would not hallucinate the same way.

−7

LessPoliticalAccount t1_jc4umir wrote

  1. Sure they could
  2. I imagine you'd have lots of situations where the probability of concurring, even with truthful responses, would be close to zero, so this wouldn't be a useful metric. Questions like "name some exotic birds that are edible, but not commonly eaten" could have thousands of valid answers, and so we wouldn't expect truthful responses to concur. Even for simpler questions, concurrence likely won't be verbatim, so how to you calculate whether or not responses have concurred? You need to train another model for that presumably, and that model will have some nonzero error rate, etc., etc.
5

visarga t1_jc5teq6 wrote

Then we need to only use a second model for strict fact checking, not creative responses. Since entailment is a common NLP task I am sure any LLM can solve it out of the box, of course with its own error rate.

1

MysteryInc152 t1_jc36042 wrote

Hallucinations are a product of training. Plausible guessing is the next best thing to reduce loss after knowledge and understanding fail (and it will find instances it fails regardless of how intelligent the system gets). Unless you reach the heart of the issue, you're not going to reduce hallucinations except for the simple fact that bigger and smarter models need to guess less and therefore hallucinate less.

There are works to reduce hallucinations by plugging in external augmentation modules https://arxiv.org/abs/2302.12813.

But really any way for the model to evaluate the correctness of its statements will reduce hallucinations.

13

buggaby OP t1_jc3a3zh wrote

Thanks for that note. This sounds like, basically, 2 data sets are needed for this process. One with general responses and language, and one with high-accuracy contextual knowledge.

> bigger and smarter models need to guess less and therefore hallucinate less

According to OpenAI

>The largest models were generally the least truthful.

So maybe we need even more work to keep these truthful.

4

MysteryInc152 t1_jc3fuso wrote

From the paper,

>While larger models were less truthful, they were more informative. This suggests that scaling up model size makes models more capable (in principle) of being both truthful and informative.

I suppose that was what i was getting at.

The only hold up with the original paper is that none of the models evaluated were instruct aligned.

But you can see the performance of more models here

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

You can see the text Davinci models are way more truthful than similar sized or even larger models. And the davinci models are more truthful than the smaller aligned Anthropic model.

3

MysteryInc152 t1_jc3hxpq wrote

Yup. Decided to go over it properly.

If you compare all the instruct tuned models on there. Greater size equals Greater truthfulness. From Ada to Babbage to Curie to Claude to Davinci-002/003.

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

So it does seem once again that scale will be in part the issue

2

buggaby OP t1_jc3ifnw wrote

Informative. Thanks. I'm a complexity scientist with training in some ML approaches, but not in transformers or other RL approaches. I'll review this (though not as fast as a LLM can...)

2

buggaby OP t1_jc3jw39 wrote

How do you find the model size? All those you listed appear to be based on GPT-3 or 3.5 which, according to my searching, are both 175B parameters. It looks to me like they are different only in the kind and amount of fine-tuning. What am I missing?

1

igorhorst t1_jc372db wrote

> Without a clear path to increasing this vital metric, I struggle to see how modern generative AI models can be used for any important tasks that are sensitive to correctness.

My immediate response is "human-in-the-loop" - let the machine generate solutions and then let the human user validate the correctness of said solutions. That being said, that relies on humans being competent to validate correctness, which may be a dubious proposition.

Perhaps a better way forward is to take a general-purpose text generator and finetune it on a more limited corpus that you can guarantee validity on. Then use this finetuned model on important tasks that are sensitive to correctness. This is the basis behind this Othello-GPT paper - take an existing GPT-3 model and finetune it on valid Othello boards so you can generate valid Othello moves. You wouldn't trust this Othello-GPT to write code for you, but you don't have to - you would find a specific machine learning model finetuned on code, and let that model generate code. It's interesting that OpenAI has Codex models that is finetuned on code, such as "code-davinci-003" (which is based off GPT-3).

This latter approach kinda reminds me of the Bitter Solution:

>The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

But the flipside of the Bitter Solution is that building knowledge into your agent (via approaches like finetuning) will lead to better results in the short-term. In the long-term, solutions based on scaling computation by search and learning may outperform current solutions - but we shouldn't wait for the long term to show up. We have tasks to solve now, and so it's okay to build knowledge into our agents. The resulting agents might become obsolete in a few years, but that's okay. We build tools to solve problems, we solve those problems, and then we retire those tools and move on.

>And certainly we are really far from anything remotely "AGI".

The issue is that we're dealing with "general intelligence" here, and just because a human is terrible at bunch of subjects, we do not say that human lacks general intelligence. I generally conflate the term "AGI" with "general-purpose", and while ChatGPT isn't fully general-purpose (at the end of the day, it just generates text - though it's surprising to me that lots of tasks can be modeled and expressed by mere text), you could use ChatGPT to generate a bunch of solutions. So, I think we're close to getting general-purpose agents that can generate solutions for everything, but the timeline for getting correct solutions for everything may be longer.

8

buggaby OP t1_jc3dslx wrote

Great resources there, thanks.

I'm quite torn by the Bitter Solution, since, in my eyes, the types of questions explored since the start of AI research have been, from one perspective, quite simple. Chess and Go (and indeed other more recent examples in Poker and real-time video games) can be easily simulated. The game is perfectly replicated in the simulation. And speech and image recognition are very easily labelled by human labellers. But I wonder if we are entering a dramatically different goal for modern algorithms.

I quite like the take in this piece about how slowly human brains work and yet how complex they are. That describes a very different learning pattern than what results from the increasing computational speed of computers. Humans learn through a relatively small number of exposures to a very highly complex set of data (the experienced world). But algorithms have always relied on huge amounts of data (even simulated data, in the case of reinforcement learning). But when this data is hard to simulate and hard to label, then how can simply increasing the computation lead to faster machine learning?

I would argue that much of the world is driven by dynamic complexity, which highlights that data is only so valuable without knowledge of the underlying structure. (One example is the 3 body problem - small changes in initial condition results in very quick and dramatic changes in future trajectory.)

As an aside, I would argue that this is one reason that AI solutions have so rarely been used in healthcare settings: the data is so sparse compared with the complexity of the problem.

It seems to me that the value of computation depends on the volume and correctness and appropriateness of the data. So many systems that we navigate and are important to us have hard-to-measure data, data that is noisy, data that is relatively sparse given the complexity of the system, and whose future behaviour is incredibly sensitive to noise in the data.

5

folk_glaciologist t1_jc68a4q wrote

You can use searches to augment the responses. You can write a python script to do this yourself via the API, making use of the fact that you can write prompts that ask ChatGPT questions about prompts. For example this is a question that will cause ChatGPT to hallucinate:

> Who are some famous people from Palmerston North?

But you can prepend some text to the prompt like this:

> I want you to give me a topic I could search Wikipedia for to answer the question below. Just output the name of the topic by itself. If the text that follows is not a request for information or is asking to generate something, it is very important to output "not applicable". The question is: <your original prompt>

If it outputs "not applicable" or searching Wikipedia with the returned topic returns nothing, then just reprocess the original prompt raw. Otherwise download the Wikipedia article (or first few paragraphs), prepend to original prompt and ask again. Etc.

In general I think that using LLMs as giant databases is the wrong approach because even if we can stop them hallucinating they will always be out of date because of the time lag to retrain them, we should be using their NLP capabilities to turn user questions into "machine-readable" (whatever that means nowadays) queries that get run behind the scenes and then fed back into the LLM. Like Bing chat doing web searches basically.

1

serge_cell t1_jcajql2 wrote

"Truth" only exists in the context of verification. You probaly would need some kind of RL to improve "truthfulness"

1

neuralnetboy t1_jc41jub wrote

We needed scientists but we got parrots

−1