Viewing a single comment thread. View all comments

velcher t1_j4ts9n0 wrote

Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).

You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.

Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.

Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.

Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).

5

JClub OP t1_j4uc0bg wrote

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?

1