Viewing a single comment thread. View all comments

farmingvillein t1_jc37p3h wrote

> The license is still limited to non-commercial use due to model being fine-tuned LLaMA.

Yeah, but they released the source code to replicate (I'm sure they knew exactly what they were doing--license is even Apache).

If the source code is pretty clean (including training code; I haven't looked closely), presumably this e2e process will be copied and the resulting model (by someone not beholden to the original LLaMA license) released to the public within the next day or so, if not by EOD.

If the code is messy, might take a couple more days.

I'd expect someone to follow the same process using turbo to bootstrap improvement (if they haven't already?), as well. This should be particularly helpful for getting it to be smarter using the entire context window in a conversation with the user.

I'd also expect someone to do so, but also mix DAN-style prompting, so that you natively can get a chatbot that is "unleashed" (whether or not this is a good idea is a separate discussion, obviously...).

Also you can expect all of the above to be applied against all the model sizes pretty quickly (33B and 65B might take a little longer, for $$$...but I wouldn't expect much longer).

It'll be extra fun because it will be released without acknowledge (for licensing reasons) of using OpenAI's API to bootstrap.

Even more fun when GPT-4 is release in the next week or so (assuming it isn't kicked out b/c SVB collapse making things noisy) and that can be used to bootstrap an even better instruction set (presumably).

tldr; things will change, quickly. (And then Emad releases an LLM and all bets are off...)

28

kittenkrazy t1_jc53y6c wrote

There’s actually been a pull request up on the transformers repo so it’s actually been relatively easy to finetune/lora. I’m currently locally running a chat version of LLaMA 4 bit 7B finetuned on anthropics hh dataset. (You also don’t need DAN or anything, but that’s probably why the license and them originally only releasing to research). Should be able to get the 30B running on a 24gb vram card with quantization. Future is crazy. We want to release it but don’t quite know how with the current license. However Stanford decides to release their model should set a precedence though.

15

generatorman_ai t1_jc5q5z0 wrote

That's great, it's been hard to find people who are actually fine-tuning LLaMA. Would you mind sharing your experience for the benefit of the open-source community?

  1. Did you train the full-precision weights?
  2. Did you use memory optimizations like xformers, 8-bit Adam (from bitsandbytes), gradient checkpointing etc.?
  3. How much VRAM does it take for a batch size of 1?
  4. hh seems to be a preference dataset for RLHF rather than a text corpus - how did you use it as a fine-tuning dataset?
  5. Did you first do instruction fine-tuning (using something like FLAN or Self-Instruct) or just the hh directly?
6

kittenkrazy t1_jc5sesx wrote

  1. Used accelerate fp16 mixed precision with deepspeed zero 2
  2. No xformers, no 8-bit Adam although I did test it and it works, no gradient checkpointing on this run but it does work.
  3. With a sequence length of 2048 I did a batch size of 1 with 8 gpus and accumulation of 4. This was on A6000s so 48 gigs of vram per card. Currently training a Lora on the 30B while training with the base model in 8-bit and can only fit 1 with a sequence length of 350. Once this one trains I’m going to try to set up a run with the model split up between the cards so I can crank up the sequence length. Will also be training the PPO phase so that will be a requirement to have enough vram lol.
  4. If you checkout the trlx repo they have some examples and they have an example of how they trained sft and ppo on the hh dataset. So it’s basically that but with llama. https://github.com/CarperAI/trlx/blob/main/examples/hh/sft_hh.py
  5. Just the hh directly. From the results it seems like it might possibly be enough but I might also try instruction tuning then running the whole process from that base. I will also be running the reinforcement learning by using a Lora using this as an example https://github.com/lvwerra/trl/tree/main/examples/sentiment/scripts/gpt-neox-20b_peft
  • I’m also thinking maybe sharing lora weights instead of the direct model is a possible way around the license issue?
5

generatorman_ai t1_jc5u7w2 wrote

Wow, 392 gigs for batch size 1? This is for 7B? That is an order of magnitude more than I was expecting. Sounds like even with full memory optimizations, we're far away from the 16 GB goal.

Good idea on the lora - since it's a completely separate set of weights I don't see how it could come under the license. In fact loras do work on weights different from the base model they were trained from (e.g. loras trained on base Stable Diffusion work when applied to heavily fine-tuned SD models), so it's not even necessarily tied to the LLaMA weights.

2

kittenkrazy t1_jc5v4is wrote

Training a Lora should be significantly cheaper especially combined with deepspeed cpu offloading and training with the model in 8 bit. Can probably get it to train on consumer cards.

And yup, completely separate unless you decide to merge them with the main model weights for faster inference/training another Lora on top/etc.

Hopefully people will share around loras for all sorts of plug and play personalities and finetuned abilities and it’ll be like stable diffusion but with personal assistants

5

generatorman_ai t1_jc5vc5r wrote

Probably I'm misinterpreting - you mean you did a batch size of 1 per GPU with 8 GPUs, so actually it's 48 GB with no optimizations (except fp16). That sounds more reasonable, though probably still too large for 16 GB with common optimizations by several gigs.

2

generatorman_ai t1_jceddn2 wrote

2

JustAnAlpacaBot t1_jcedea5 wrote

Hello there! I am a bot raising awareness of Alpacas

Here is an Alpaca Fact:

Alpaca beans make excellent fertilizer and tend to defecate in only a few places in the paddock.


| Info| Code| Feedback| Contribute Fact

You don't get a fact, you earn it. If you got this fact then AlpacaBot thinks you deserved it!
0

ribeirao t1_jc3d926 wrote

> (And then Emad releases an LLM and all bets are off...)

can you explain this part?

11

farmingvillein t1_jc3fqod wrote

Speculative, but Emad has heavily signaled that they will be releasing to the public an LLM.

People are doing some really cool stuff with llama right now, but it all lives in a bit of a grey area, for the obvious reasons related to licensing (of both the model weights and the underlying gplv3 code).

If Emad releases a comparable LLM publicly, but with a generally permissive license (which is not a guarantee...), all of this hacker energy will immediately go into a model/platform that is suddenly (in this scenario) widely available, commercially usable (which means more people banging away at it, including with levels of compute that don't make sense for the average individual but are trivial for even a modestly funded AI startup), etc.

Further, SD has done a really good job of building a community around the successive releases, which--done right--means increased engagement (=better tooling) with each release, since authors know that they are not only investing in a model today, but that they are investing in a "platform" for tomorrow. I.e., the (idealized) open source snowball effect.

Additionally, there is a real chance that SD releases something better than llama*, which will of course further accelerate adoption by parties who will then invest dollars to improve it.

This is all extra important, because there has been a lot of cool research coming out about improving models via [insert creative fine-tuning/RL method, often combined with clever use of chain-of-thought/APIs/retrieval systems/etc.]. Right now, these methods are only really leveraged against very small models (which can be fine-tuned, but still aren't that great) or using something like OpenAI as a black box. A community building up around actually powerful models will allow these techniques to get applied "at scale", i.e., into the community. This has the potential to be very impactful.

Lastly, as noted, GPT-4 (even though notionally against ToS) is going to make it (presumably) even easier to create high-quality instruction tuning. That is going to get built and moved into public GPT-3-like models very, very quickly--which definitely means much faster tuning cycles, and possibly means higher-quality tuning.

(*=not because "Meta sux", to be clear, but because SD will more happily pull out all the stops--use more data, throw even more model bells & whistles at it, etc.)

24

rolexpo t1_jc3yuyl wrote

If FB released this under a more permissive license they would've gotten so much goodwill from the developer community =/

8

gwern t1_jc42lxd wrote

And yet, they get shit on for releasing it at all (never mind in a way they knew perfectly well would leak), while no one ever seems to remember all of the other models which didn't get released at all... And ironically, Google is over there releasing Flan-T5 under a FLOSS license & free to download, as it has regularly released the best T5 models, and no one notices it exists - you definitely won't find it burning up the HN or /r/ML front pages. Suffice it to say that the developer community has never been noted for its consistency or gratitude, so optimizing for that is a mug's game.

(I never fail to be boggled at complaints about 'AI safety fearmongering is why we had to wait all these years instead of OA just releasing GPT-3', where the person completely ignores the half-a-dozen other GPT-3-scale models which are still unreleased, like most models were unreleased, for reasons typically not including safety.)

12

extopico t1_jc5revh wrote

Flan-t5 is good and flan-t5-xl runs well on 3060 in 8 bit mode. It’s not meant to be a chatbot however so that’s why it does not stir up so much excitement. T5 is best used for tasks and training it to handle specific domains. This makes it far more interesting to me than LLaMa which cannot be trained (yet) by us randoms.

4

generatorman_ai t1_jc5vsbw wrote

T5 is below the zero-shot phase transition crossed by GPT-3 175B (and presumably by LLaMA 7B). Modern models with instruction and HF finetuning will not need further task-specific finetuning for most purposes.

4

oathbreakerkeeper t1_jc5viv0 wrote

Who is emad? And who is SD?

6

nigh8w0lf t1_jc607jo wrote

Mohammad Emad Mostaque is the founder and CEO of Stability AI, which created Stable Diffusion (SD)

10

LetterRip t1_jc79qjb wrote

Stability.AI has been funding RWKV's training.

2

currentscurrents t1_jc3j86d wrote

> (by someone not beholden to the original LLaMA license)

That's not how software licenses work. You're still beholden to the license even if you torrented it.

I've heard some people theorize that ML models can't be copyrighted, but there's no case law on this yet so it's all speculation. I wouldn't suggest starting a business based around LLaMa until someone else has been the guinea pig.

10