bigblueboo t1_iznegx5 wrote on December 10, 2022 at 11:58 AM

Reply to comment by FerretDude in [R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist

I’ve been wondering, why/how is it better to train a reward model on human preferences and do RL then just doing supervised fine tuning on that human data? Is there an intuition, empirical finding, logistical reason?