kittenkrazy

kittenkrazy t1_jc5v4is wrote

Training a Lora should be significantly cheaper especially combined with deepspeed cpu offloading and training with the model in 8 bit. Can probably get it to train on consumer cards.

And yup, completely separate unless you decide to merge them with the main model weights for faster inference/training another Lora on top/etc.

Hopefully people will share around loras for all sorts of plug and play personalities and finetuned abilities and it’ll be like stable diffusion but with personal assistants

5

kittenkrazy t1_jc5sesx wrote

  1. Used accelerate fp16 mixed precision with deepspeed zero 2
  2. No xformers, no 8-bit Adam although I did test it and it works, no gradient checkpointing on this run but it does work.
  3. With a sequence length of 2048 I did a batch size of 1 with 8 gpus and accumulation of 4. This was on A6000s so 48 gigs of vram per card. Currently training a Lora on the 30B while training with the base model in 8-bit and can only fit 1 with a sequence length of 350. Once this one trains I’m going to try to set up a run with the model split up between the cards so I can crank up the sequence length. Will also be training the PPO phase so that will be a requirement to have enough vram lol.
  4. If you checkout the trlx repo they have some examples and they have an example of how they trained sft and ppo on the hh dataset. So it’s basically that but with llama. https://github.com/CarperAI/trlx/blob/main/examples/hh/sft_hh.py
  5. Just the hh directly. From the results it seems like it might possibly be enough but I might also try instruction tuning then running the whole process from that base. I will also be running the reinforcement learning by using a Lora using this as an example https://github.com/lvwerra/trl/tree/main/examples/sentiment/scripts/gpt-neox-20b_peft
  • I’m also thinking maybe sharing lora weights instead of the direct model is a possible way around the license issue?
5

kittenkrazy t1_jc53y6c wrote

There’s actually been a pull request up on the transformers repo so it’s actually been relatively easy to finetune/lora. I’m currently locally running a chat version of LLaMA 4 bit 7B finetuned on anthropics hh dataset. (You also don’t need DAN or anything, but that’s probably why the license and them originally only releasing to research). Should be able to get the 30B running on a 24gb vram card with quantization. Future is crazy. We want to release it but don’t quite know how with the current license. However Stanford decides to release their model should set a precedence though.

15