Submitted by robotphilanthropist t3_zh2u3k in MachineLearning
bigblueboo t1_iznegx5 wrote
Reply to comment by FerretDude in [R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist
I’ve been wondering, why/how is it better to train a reward model on human preferences and do RL then just doing supervised fine tuning on that human data? Is there an intuition, empirical finding, logistical reason?
Viewing a single comment thread. View all comments