liyanjia92 OP t1_jdj87cv wrote on March 24, 2023 at 7:54 PM

Reply to comment by G_fucking_G in [P] ChatGPT with GPT-2: A minimum example of aligning language models with RLHF similar to ChatGPT by liyanjia92

SFT is a bit longer, probably 8-12 hours I need to check the tensorboard to verify. Reward Model is faster because it only need to do 1 epoch, just a couple of hours. RLHF is the slowest because of its complexity (4 models interacting with each other), probably need to improve the "make_experiment" part of code, GPU is also often idle. So it could take days to just do 1 epoch. I didn't finish tuning because even if we just do RLHF on maybe 10K examples it's already outperforming SFT model in terms of "human" preference.