Viewing a single comment thread. View all comments

buzzbuzzimafuzz t1_j4u5jrz wrote

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

9

JClub OP t1_j4uc8lc wrote

Yes, that makes sense! But for example, can you really combine a thumbs-up/down experience with a scale of 1-5? That will be even harder to make them both work together when training the model, right?

1

koolaidman123 t1_j4uuko0 wrote

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

1

JClub OP t1_j4v057p wrote

yeah, instructGPT is like that. How do you calculate a reward score for each output in this ranking scenario?

1

koolaidman123 t1_j4v2uyq wrote

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

2

JClub OP t1_j4v5d0y wrote

Ah right, then you can just use the model's reward directly or pass it through a sigmoid so that the reward is between 0-1!

Do you think that the sigmoid is needed?

2