Ouitos t1_j54nh7v wrote on January 20, 2023 at 10:56 AM

Reply to comment by JClub in [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF) by JClub

Yes, but If you have a ratio of 0.6, you then take the min of 0.6 * R and 0.8 * R, which is ratio * R. In the end, the clip is only effective one way, and the 0.8 lower limit is never used. Or maybe R has a particular property that makes this not as straight forward ?

JClub OP t1_j57rrn6 wrote on January 21, 2023 at 12:09 AM

ah yes, you're right. I actually don't know why, but you can check the implementation and ask it on GitHub