Viewing a single comment thread. View all comments

TheBigFeIIa t1_j8vb4qa wrote

An error being “sticky” is a great way to put it as far as the modeling goes. Gets to a more fundamental problem of the reward structure not optimizing for more objective truths and instead rewarding plausible or more pleasing responses but not necessarily completely factual.

I do wonder if there was any way to generate a confidence estimation with answers, and allow for the concept of “I don’t know.” as a valid approach in a low confidence response. In some cases a truthful acknowledgement of the lack of an answer may be more useful/beneficial than a made-up response

3

gurenkagurenda t1_j8voslg wrote

Log probabilities are the actual output of the model (although what those probabilities directly mean once you're using reinforcement learning seems sort of nebulous), and I wonder if uncertainty about actual facts is reflected in lower probabilities in the top scoring tokens. If so, you could imagine encoding the scores in the actual output (ultimately hidden from the user), so that the model can keep track of its past uncertainty. You could imagine that with training, it might be able to interpret what those low scoring tokens imply, from "I'm not sure I'm using this word correctly" to "this one piece might be mistaken" to "this one piece might be wrong, and if so, everything after it is wrong".

2