mtocrat t1_j6zk1ka wrote on February 3, 2023 at 12:14 AM

Let's say your initial model is quite racist and outputs only extremely or moderately racist choices. If you rank those against each other and do supervised training on that dataset you train it to mimic the moderately racist style. You might however plausibly train a model from this that can judge what racism is and extrapolate to judge answers free of it to be even better. Then you optimize with respect to that model to get that style