Viewing a single comment thread. View all comments

JClub OP t1_j4uc0bg wrote on January 18, 2023 at 8:41 AM

Reply to comment by velcher in [D] RLHF - What type of rewards to use? by JClub

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?