Viewing a single comment thread. View all comments

sineiraetstudio t1_jdymf8q wrote

Oh, RLHF absolutely has all sorts of benefits (playing with top-p only makes answers more consistent - but sometimes you want to optimize for something different than "most likely"), so it's definitely here to stay (for now?), it's just not purely positive. Ideally we'd have a RLHF version that's still well calibrated (or even better, some way to determine confidence without looking at logits that also works with chain of thought prompting).

2