koolaidman123 t1_j6wtmdj wrote
- Outputs are not ranked 1-5, they're ranked 2 at a time head to head and the rm predicts which is more favored by humans
- Empirically they found rl outperformed supervised fine-tuning (sft) on human evaluations, meaning humans generally preferred the rlhf model vs the sft model. The sft model was ft using the top ranked answer
As to why rl outperform sft, not a lot of orgs have the resources to test this (yet), I've heard a plausible theory from ai2 that the main difference comes from the fact that sft uses a token level loss, whereas rl loss takes the entire sentence, so maybe instead of rl being "better" its just next token prediction task is worse
Reseachers ive spoken with dont believe rl is the critical component to enable these models, and that we could eventually discover the right training regime to enable sft to perform on par (or better) than rl
alpha-meta OP t1_j6wvgbr wrote
Thanks for the response! I just double-checked the InstructGPT paper and you were right regarding the rankings -- they are pairwise, and I am not sure why I thought otherwise.
Regarding the updates on a sentence level, that makes sense. That would be more of a discrete problem as well for which you probably can't backpropagate (otherwise, you would be back to token-level).
was_der_Fall_ist t1_j6xz6wj wrote
ChatGPT had labelers rank outputs from best to worst, not head to head. (Different than InstructGPT, maybe?)
“A prompt and several outputs are generated. A labeler ranks the outputs from best to worst.”
koolaidman123 t1_j6y07he wrote
have you even read the instructGPT paper?
>In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. This produces (K C 2 ) comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all (K C 2 ) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than (K 2 ) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss. Specifically, the loss function for the reward model is: loss (θ) = − 1/ (K C 2 ) E(x,yw ,yl )∼D [log (σ (rθ (x, yw) − rθ (x, yl)))] (1) where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons.
you know that figure you're referencing comes from the instructgpt paper... right?
[deleted] t1_j6x0948 wrote
[deleted]
koolaidman123 t1_j6x2b05 wrote
sure? you can have multiple ways of ranking, but:
- the instructGPT paper strictly uses pairwise ranking
- asking annotators to rank however many passages 1-k in 1 shot is much more difficult and subject to noise than asking for pairwise comparisons
crt09 t1_j6y5x4t wrote
This paper seems very relevant: https://arxiv.org/abs/2205.13636 I haven't read it closely enough to give strong opinions with confidence but it seems to beat PPO with a token level loss thats works similar to the Upside Down Reinforcement Learning paper, where you give a target reward between 1 and 5 as an input token before the prompt and train it to output a response of a coressponding quality, trained on the standard LM loss on an existing target output with the given 1-5 reward rank. Then during inference you just append 1 to the start of the prompt and it outputs a response of high quality
mtocrat t1_j6zin88 wrote
supervised fine-tuning seems inherently limited here. You regress to the best in the set of answers but that's it. RLHF can improve beyond that, up to the point where the generalization capabilities of the reward model fail..
Viewing a single comment thread. View all comments