hey folks, happy Friday! I wish to get some feedback for my recent project of a minimum example of using RLHF on language models to improve human alignment.

The goal is to compare with vanilla GPT-2 and supervised fine-tuned GPT-2 to see how much RLHF can benefit small models. Also I hope this project can show an example of the minimum requirements to build a RLHF training pipeline for LLMs.

Github: https://github.com/ethanyanjiali/minChatGPT Demo: https://colab.research.google.com/drive/1LR1sbWTyaNAmTZ1g1M2tpmU_pFw1lyEX?usp=sharing

Thanks a lot for any suggestions and feedback!

Comments

You must log in or register to comment.

KingsmanVince t1_jdgyviy wrote on March 24, 2023 at 9:49 AM

#2,332,414

That's a very good school project! Good job!

[deleted] t1_jdhlwfi wrote on March 24, 2023 at 1:36 PM

#2,334,934

[removed]

liyanjia92 OP t1_jdi3dvw wrote on March 24, 2023 at 3:33 PM

#2,336,903

Replying to KingsmanVince (#2,332,414)

Thanks!

G_fucking_G t1_jdifa1c wrote on March 24, 2023 at 4:48 PM

#2,338,197

Very interesting.

Quick question. How long does training take? For:

SFT Model
Reward Model
RLHF

I saw you used one 3090Ti, so was it done in hours/days/weeks?

Extension-Mastodon67 t1_jdivgzf wrote on March 24, 2023 at 6:31 PM

#2,340,002

I guess gpt2-medium is too small for RLHF.

PPO output: 

Human: If I have 3 bananas and 4 pears. How many fruits do I have?

Assistant: 

3 bananas: 1 pears, 1⁄2 cup of sugar, 2 tbsp of butter, and 1⁄2 tsp of cinnamon
4 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of cinnamon
5 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of cinnamon
6 bananas: 1⁄4 cup of sugar, 1 tbsp of butter, 1⁄2 tsp of

--------------

Vanilla output: 

Human: If I have 3 bananas and 4 pears. How many fruits do I have?

Assistant: 

Me: 3 bananas: 46 fruits (1 pie, half banana, 3 bananas)

Assistant: 46 pears: 39 fruits (1 pie, half orange, 1 pear)

Me: 47 oranges: 38 fruits (1 pie, half banana, 4 oranges)

Assistant: 47 pears: 43 fruits (1 pie, half banana, 6 oranges)

Me: 49 apples: 4 fruit (1 pie, half banana, 2 apples)

Assistant: 49 pears: 38 fruits (3 apples, 1 pear, 2 apples)

Me: 50 nectarines: 3 fruit (1 pie, half orange, 1 pear)

Assistant: 50 pears: 37 fruits (1 pie, half orange, 4

currentscurrents t1_jdj5p8d wrote on March 24, 2023 at 7:37 PM

#2,341,075

Replying to Extension-Mastodon67 (#2,340,002)

gpt2-medium is only 355M parameters so don't expect quality.

Even the fullsize GPT-2 is smaller than the smallest variants of most modern language models.

liyanjia92 OP t1_jdj7h0x wrote on March 24, 2023 at 7:49 PM

#2,341,272

Replying to Extension-Mastodon67 (#2,340,002)

Thanks for trying out! This is a good example to show the difference between RLHF'ed GPT-2 medium vs the vanilla GPT-2 medium. You can see that GPT-2 medium is completely outputting garbage while the RLHF version tend to come up with some answer for human. (although it failed)

The way i see this is that pre-trained model encode the knowledge of the world, and RLHF is just a way to align the model with human's preference of how to interact with the world.

You might see this tweet before: https://twitter.com/geoffreyhinton/status/1636110447442112513?s=20

So with GPT-2 medium, what we really do here is to parent a dumb kid, instead of a "supernaturally precocious child" like GPT-3. What interested me is that RLHF does actually help to parent this dumb kid to be more socially acceptable.

In other words, if we discover the power of alignment and RLHF earlier, we might foresee the ChatGPT moment much earlier when GPT-2 is out in 2019.

I'm also thinking to do the same with LLaMA to maybe have a nanoChatGPT that actually could be useful for a real life application. Stay tuned!

liyanjia92 OP t1_jdj87cv wrote on March 24, 2023 at 7:54 PM

#2,341,348

Replying to G_fucking_G (#2,338,197)

SFT is a bit longer, probably 8-12 hours I need to check the tensorboard to verify. Reward Model is faster because it only need to do 1 epoch, just a couple of hours. RLHF is the slowest because of its complexity (4 models interacting with each other), probably need to improve the "make_experiment" part of code, GPU is also often idle. So it could take days to just do 1 epoch. I didn't finish tuning because even if we just do RLHF on maybe 10K examples it's already outperforming SFT model in terms of "human" preference.

Puzzleheaded_Acadia1 t1_jdjukpg wrote on March 24, 2023 at 10:27 PM

#2,343,875

Replying to G_fucking_G (#2,338,197)

I'm new to this can you explain what is the project about and what is SFT Model, reward model, RLHF and what is an epoch?

Puzzleheaded_Acadia1 t1_jdjvola wrote on March 24, 2023 at 10:35 PM

#2,344,000

I have questions can I fine-tune the gpt-neo-x 125m parameters on chat dataset to give me a decent answer like human because when I run it give me random characters

liyanjia92 OP t1_jdjwfnh wrote on March 24, 2023 at 10:41 PM

#2,344,100

Replying to Puzzleheaded_Acadia1 (#2,344,000)

It maybe better to submit an issue on github so that i can point you to some code with context. if you are talking my code, you need to convert the weights and load it into GPT class before running SFT training. otherwise there might be mismatch in weights and it could just output random stuff.

liyanjia92 OP t1_jdjx0zs wrote on March 24, 2023 at 10:45 PM

#2,344,169

Replying to Puzzleheaded_Acadia1 (#2,343,875)

The project is to explore if RLHF can help smaller models to also output something naturally in a human/assistant conversation.

you can take a look at this Get Started section for more details: https://github.com/ethanyanjiali/minChatGPT#get-started

in short, SFT is supervised fine-tuning, reward model is the one that used to generate reward giving the language model output (action) in the reinforcement learning. RLHF is to use human feedback to set up reinforcement learning, and an epoch means the model see all the data by once.

https://web.stanford.edu/class/cs224n/ this could be a good class if you are new, they have a youtube version from 2021 (except that they probably didn't talk about RLHF back then)

Puzzleheaded_Acadia1 t1_jdkvut4 wrote on March 25, 2023 at 3:18 AM

#2,347,809

Replying to liyanjia92 (#2,344,169)

Thx

blueSGL t1_jdl02u6 wrote on March 25, 2023 at 3:57 AM

#2,348,157

Replying to liyanjia92 (#2,341,272)

>So with GPT-2 medium, what we really do here is to parent a dumb kid, instead of a "supernaturally precocious child" like GPT-3. What interested me is that RLHF does actually help to parent this dumb kid to be more socially acceptable.

> In other words, if we discover the power of alignment and RLHF earlier, we might foresee the ChatGPT moment much earlier when GPT-2 is out in 2019.

That just reads to me as capability overhang. If there is "one simple trick" to make the model "behave" what's to say there that this is the only one. (or that the capabilities derived from the current behavior modification are the 'best they can be') Scary thought.

KingsmanVince t1_jdz7k7x wrote on March 28, 2023 at 7:25 AM

#2,411,377

Replying to liyanjia92 (#2,336,903)

https://github.com/nichtdax/awesome-totally-open-chatgpt#ethanyanjialiminchatgpt

And your work is listed as other alternative for ChatGPT