Submitted by JClub t3_10fh79i in MachineLearning
​
You must have heard about ChatGPT. Maybe you heard that it was trained with RLHF and PPO. Perhaps you do not really understand how that process works. Then check my Gist on Reinforcement Learning from Human Feedback (RLHF): https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093
No hard maths, straight to the point and simplified. Hope that it helps!
dataslacker t1_j4xd5aj wrote
That’s a nice explanation but I’m still unclear as to the motivation for RL. You say the reward isn’t differentiable but since it’s just a label that tells us which of the outputs is best why not simply use that output with supervised training?