Submitted by liyanjia92 t3_120csub in MachineLearning

hey folks, happy Friday! I wish to get some feedback for my recent project of a minimum example of using RLHF on language models to improve human alignment.

The goal is to compare with vanilla GPT-2 and supervised fine-tuned GPT-2 to see how much RLHF can benefit small models. Also I hope this project can show an example of the minimum requirements to build a RLHF training pipeline for LLMs.

Github: https://github.com/ethanyanjiali/minChatGPT Demo: https://colab.research.google.com/drive/1LR1sbWTyaNAmTZ1g1M2tpmU_pFw1lyEX?usp=sharing

Thanks a lot for any suggestions and feedback!

77

Comments

You must log in or register to comment.

G_fucking_G t1_jdifa1c wrote

Very interesting.

Quick question. How long does training take? For:

  • SFT Model
  • Reward Model
  • RLHF

I saw you used one 3090Ti, so was it done in hours/days/weeks?

6

Extension-Mastodon67 t1_jdivgzf wrote

I guess gpt2-medium is too small for RLHF.

PPO output: 

Human: If I have 3 bananas and 4 pears. How many fruits do I have?

Assistant: 

3 bananas: 1 pears, 1⁄2 cup of sugar, 2 tbsp of butter, and 1⁄2 tsp of cinnamon
4 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of cinnamon
5 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of cinnamon
6 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of

--------------

Vanilla output: 

Human: If I have 3 bananas and 4 pears. How many fruits do I have?

Assistant: 

Me: 3 bananas: 46 fruits (1 pie, half banana, 3 bananas)

Assistant: 46 pears: 39 fruits (1 pie, half orange, 1 pear)

Me: 47 oranges: 38 fruits (1 pie, half banana, 4 oranges)

Assistant: 47 pears: 43 fruits (1 pie, half banana, 6 oranges)

Me: 49 apples: 4 fruit (1 pie, half banana, 2 apples)

Assistant: 49 pears: 38 fruits (3 apples, 1 pear, 2 apples)

Me: 50 nectarines: 3 fruit (1 pie, half orange, 1 pear)

Assistant: 50 pears: 37 fruits (1 pie, half orange, 4
1

liyanjia92 OP t1_jdj7h0x wrote

Thanks for trying out! This is a good example to show the difference between RLHF'ed GPT-2 medium vs the vanilla GPT-2 medium. You can see that GPT-2 medium is completely outputting garbage while the RLHF version tend to come up with some answer for human. (although it failed)

The way i see this is that pre-trained model encode the knowledge of the world, and RLHF is just a way to align the model with human's preference of how to interact with the world.

You might see this tweet before: https://twitter.com/geoffreyhinton/status/1636110447442112513?s=20

So with GPT-2 medium, what we really do here is to parent a dumb kid, instead of a "supernaturally precocious child" like GPT-3. What interested me is that RLHF does actually help to parent this dumb kid to be more socially acceptable.

In other words, if we discover the power of alignment and RLHF earlier, we might foresee the ChatGPT moment much earlier when GPT-2 is out in 2019.

I'm also thinking to do the same with LLaMA to maybe have a nanoChatGPT that actually could be useful for a real life application. Stay tuned!

4

liyanjia92 OP t1_jdj87cv wrote

SFT is a bit longer, probably 8-12 hours I need to check the tensorboard to verify. Reward Model is faster because it only need to do 1 epoch, just a couple of hours. RLHF is the slowest because of its complexity (4 models interacting with each other), probably need to improve the "make_experiment" part of code, GPU is also often idle. So it could take days to just do 1 epoch. I didn't finish tuning because even if we just do RLHF on maybe 10K examples it's already outperforming SFT model in terms of "human" preference.

5

Puzzleheaded_Acadia1 t1_jdjvola wrote

I have questions can I fine-tune the gpt-neo-x 125m parameters on chat dataset to give me a decent answer like human because when I run it give me random characters

1

liyanjia92 OP t1_jdjwfnh wrote

It maybe better to submit an issue on github so that i can point you to some code with context. if you are talking my code, you need to convert the weights and load it into GPT class before running SFT training. otherwise there might be mismatch in weights and it could just output random stuff.

2

liyanjia92 OP t1_jdjx0zs wrote

The project is to explore if RLHF can help smaller models to also output something naturally in a human/assistant conversation.

you can take a look at this Get Started section for more details: https://github.com/ethanyanjiali/minChatGPT#get-started

in short, SFT is supervised fine-tuning, reward model is the one that used to generate reward giving the language model output (action) in the reinforcement learning. RLHF is to use human feedback to set up reinforcement learning, and an epoch means the model see all the data by once.

https://web.stanford.edu/class/cs224n/ this could be a good class if you are new, they have a youtube version from 2021 (except that they probably didn't talk about RLHF back then)

3

blueSGL t1_jdl02u6 wrote

>So with GPT-2 medium, what we really do here is to parent a dumb kid, instead of a "supernaturally precocious child" like GPT-3. What interested me is that RLHF does actually help to parent this dumb kid to be more socially acceptable.

> In other words, if we discover the power of alignment and RLHF earlier, we might foresee the ChatGPT moment much earlier when GPT-2 is out in 2019.

That just reads to me as capability overhang. If there is "one simple trick" to make the model "behave" what's to say there that this is the only one. (or that the capabilities derived from the current behavior modification are the 'best they can be') Scary thought.

2