Viewing a single comment thread. View all comments

was_der_Fall_ist t1_jdwz4qw wrote

I think you make a good point. We probably need better methods of post-training LLMs. But it does seem like the current regime is still sometimes more useful than the pre-trained model, which Christiano also says. It's only in some contexts that this behavior is worse. I'm not sure if it's really better than top-p sampling, though. I'm not sure that it is. But RLHF models do seem pretty useful.

2

sineiraetstudio t1_jdymf8q wrote

Oh, RLHF absolutely has all sorts of benefits (playing with top-p only makes answers more consistent - but sometimes you want to optimize for something different than "most likely"), so it's definitely here to stay (for now?), it's just not purely positive. Ideally we'd have a RLHF version that's still well calibrated (or even better, some way to determine confidence without looking at logits that also works with chain of thought prompting).

2