Submitted by JClub t3_10emf7a in MachineLearning
JClub OP t1_j4uc0bg wrote
Reply to comment by velcher in [D] RLHF - What type of rewards to use? by JClub
PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?
Viewing a single comment thread. View all comments