bigblueboo t1_iznegx5 wrote on December 10, 2022 at 11:58 AM Reply to comment by FerretDude in [R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist I’ve been wondering, why/how is it better to train a reward model on human preferences and do RL then just doing supervised fine tuning on that human data? Is there an intuition, empirical finding, logistical reason? Permalink Parent 4
bigblueboo t1_iznegx5 wrote
Reply to comment by FerretDude in [R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist
I’ve been wondering, why/how is it better to train a reward model on human preferences and do RL then just doing supervised fine tuning on that human data? Is there an intuition, empirical finding, logistical reason?