generatorman_ai t1_jc5q5z0 wrote on March 14, 2023 at 6:07 AM

Reply to comment by kittenkrazy in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

That's great, it's been hard to find people who are actually fine-tuning LLaMA. Would you mind sharing your experience for the benefit of the open-source community?

Did you train the full-precision weights?
Did you use memory optimizations like xformers, 8-bit Adam (from bitsandbytes), gradient checkpointing etc.?
How much VRAM does it take for a batch size of 1?
hh seems to be a preference dataset for RLHF rather than a text corpus - how did you use it as a fine-tuning dataset?
Did you first do instruction fine-tuning (using something like FLAN or Self-Instruct) or just the hh directly?

kittenkrazy t1_jc5sesx wrote on March 14, 2023 at 6:36 AM

Used accelerate fp16 mixed precision with deepspeed zero 2
No xformers, no 8-bit Adam although I did test it and it works, no gradient checkpointing on this run but it does work.
With a sequence length of 2048 I did a batch size of 1 with 8 gpus and accumulation of 4. This was on A6000s so 48 gigs of vram per card. Currently training a Lora on the 30B while training with the base model in 8-bit and can only fit 1 with a sequence length of 350. Once this one trains I’m going to try to set up a run with the model split up between the cards so I can crank up the sequence length. Will also be training the PPO phase so that will be a requirement to have enough vram lol.
If you checkout the trlx repo they have some examples and they have an example of how they trained sft and ppo on the hh dataset. So it’s basically that but with llama. https://github.com/CarperAI/trlx/blob/main/examples/hh/sft_hh.py
Just the hh directly. From the results it seems like it might possibly be enough but I might also try instruction tuning then running the whole process from that base. I will also be running the reinforcement learning by using a Lora using this as an example https://github.com/lvwerra/trl/tree/main/examples/sentiment/scripts/gpt-neox-20b_peft

I’m also thinking maybe sharing lora weights instead of the direct model is a possible way around the license issue?

generatorman_ai t1_jc5u7w2 wrote on March 14, 2023 at 7:01 AM

Wow, 392 gigs for batch size 1? This is for 7B? That is an order of magnitude more than I was expecting. Sounds like even with full memory optimizations, we're far away from the 16 GB goal.

Good idea on the lora - since it's a completely separate set of weights I don't see how it could come under the license. In fact loras do work on weights different from the base model they were trained from (e.g. loras trained on base Stable Diffusion work when applied to heavily fine-tuned SD models), so it's not even necessarily tied to the LLaMA weights.

kittenkrazy t1_jc5v4is wrote on March 14, 2023 at 7:14 AM

Training a Lora should be significantly cheaper especially combined with deepspeed cpu offloading and training with the model in 8 bit. Can probably get it to train on consumer cards.

And yup, completely separate unless you decide to merge them with the main model weights for faster inference/training another Lora on top/etc.

Hopefully people will share around loras for all sorts of plug and play personalities and finetuned abilities and it’ll be like stable diffusion but with personal assistants

generatorman_ai t1_jc5vc5r wrote on March 14, 2023 at 7:17 AM

Probably I'm misinterpreting - you mean you did a batch size of 1 per GPU with 8 GPUs, so actually it's 48 GB with no optimizations (except fp16). That sounds more reasonable, though probably still too large for 16 GB with common optimizations by several gigs.