mtocrat t1_j6zin88 wrote
Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
supervised fine-tuning seems inherently limited here. You regress to the best in the set of answers but that's it. RLHF can improve beyond that, up to the point where the generalization capabilities of the reward model fail..
Viewing a single comment thread. View all comments