Viewing a single comment thread. View all comments

mtocrat t1_j6zk1ka wrote

Let's say your initial model is quite racist and outputs only extremely or moderately racist choices. If you rank those against each other and do supervised training on that dataset you train it to mimic the moderately racist style. You might however plausibly train a model from this that can judge what racism is and extrapolate to judge answers free of it to be even better. Then you optimize with respect to that model to get that style

2