LetterRip

LetterRip t1_jc3864s wrote

Source code and weights are different licenses.

LLama license in the request form appears to be the same,

Relevant part here

> a. Subject to your compliance with the Documentation and Sections 2, 3, and 5, Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. The foregoing license is personal to you, and you may not assign or sublicense this License or any other rights or obligations under this License without Meta’s prior written consent; any such assignment or sublicense will be void and will automatically and immediately terminate this License.

https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform

as linked from

https://github.com/facebookresearch/llama

2

LetterRip t1_jbks0mg wrote

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

3

LetterRip t1_jbkmk5e wrote

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV

3

LetterRip t1_jbkdshr wrote

Here is what the author stated in the thread,

> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!

I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.

2

LetterRip t1_jbjphkw wrote

> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

This was posted by DeepMind a month ago,

https://www.reddit.com/r/MachineLearning/comments/10ja0gg/r_deepmind_neural_networks_and_the_chomsky/

I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.

So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.

> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.

There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.

> If your idea is truly good you will get at attention sooner or later anyways.

Or be ignored for all time till someone else discovers the idea and gets credit for it.

In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.

> I don't buy the argument that it's too new or hard to understand.

It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.

> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)

No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.

The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which

4

LetterRip t1_jbjfiyg wrote

  1. The larger models (3B, 7B, 14B) have only been released quite recently

  2. Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted

  3. people want to know that it actually scales before investing work into it.

  4. Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.

13

LetterRip t1_javpxbv wrote

> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).

Flash attention is June of 2022.

Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.

https://arxiv.org/abs/2208.07339

> That seems large, which paper has that?

See

https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg

>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.

https://github.com/HazyResearch/flash-attention

1

LetterRip t1_jal4vgs wrote

Yep, or a mix between the two.

GLM-130B quantized to int4, OPT and BLOOM int8,

https://arxiv.org/pdf/2210.02414.pdf

Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),

Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

https://arxiv.org/pdf/2103.13630.pdf

Here is a talk on auto48 (automatic mixed int4/int8 quantization)

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/

11

LetterRip t1_jajezib wrote

June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.

Memory efficient attention is from a few months ago.

ChatGPT was just introduced a few months ago.

The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.

52

LetterRip t1_jaj1kp3 wrote

> I have no idea how OpenAI can make money on this.

Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.

A*.3/3 = 10% of the cost.

Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.

So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.

edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation

https://github.com/microsoft/DeepSpeed-MII

Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.

252