Viewing a single comment thread. View all comments

JClub OP t1_j4uc0bg wrote

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?

1