LetterRip t1_jcl6axl wrote on March 17, 2023 at 5:03 PM

Reply to comment by bo_peng in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng

Rocky

LetterRip t1_jc79qjb wrote on March 14, 2023 at 3:45 PM

Reply to comment by farmingvillein in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Stability.AI has been funding RWKV's training.

LetterRip t1_jc4rifv wrote on March 14, 2023 at 12:56 AM

Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Depends on the model. Some have difficulty even with full 8bit quantization; others you can go to 4bit relatively easily. There is some research that suggests 3bit might be the useful limit, with rarely certain 2bit models.

LetterRip t1_jc3864s wrote on March 13, 2023 at 6:41 PM

Reply to comment by cyvr_com in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Source code and weights are different licenses.

LLama license in the request form appears to be the same,

Relevant part here

> a. Subject to your compliance with the Documentation and Sections 2, 3, and 5, Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. The foregoing license is personal to you, and you may not assign or sublicense this License or any other rights or obligations under this License without Meta’s prior written consent; any such assignment or sublicense will be void and will automatically and immediately terminate this License.

https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform

as linked from

https://github.com/facebookresearch/llama

LetterRip t1_jbtn573 wrote on March 11, 2023 at 4:56 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

number of total tokens in input + output.

LetterRip t1_jbks0mg wrote on March 9, 2023 at 7:21 PM

Reply to comment by Aran_Komatsuzaki in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

LetterRip t1_jbkmk5e wrote on March 9, 2023 at 6:47 PM

Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV

LetterRip t1_jbkdshr wrote on March 9, 2023 at 5:53 PM

Reply to comment by farmingvillein in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

Here is what the author stated in the thread,

> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!

I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.

LetterRip t1_jbjphkw wrote on March 9, 2023 at 3:19 PM

Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

This was posted by DeepMind a month ago,

https://www.reddit.com/r/MachineLearning/comments/10ja0gg/r_deepmind_neural_networks_and_the_chomsky/

I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.

So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.

> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.

There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.

> If your idea is truly good you will get at attention sooner or later anyways.

Or be ignored for all time till someone else discovers the idea and gets credit for it.

In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.

> I don't buy the argument that it's too new or hard to understand.

It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.

> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)

No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.

The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which

LetterRip t1_jbjfiyg wrote on March 9, 2023 at 2:10 PM

Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

The larger models (3B, 7B, 14B) have only been released quite recently
Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted
people want to know that it actually scales before investing work into it.
Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.

LetterRip t1_jb5bgvj wrote on March 6, 2023 at 3:32 PM

Reply to [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Greatly appreciated, you might run it on aesthetic and 5B also.

LetterRip t1_javpxbv wrote on March 4, 2023 at 1:07 PM

Reply to comment by CellWithoutCulture in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).

Flash attention is June of 2022.

Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.

https://arxiv.org/abs/2208.07339

> That seems large, which paper has that?

See

https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg

>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.

https://github.com/HazyResearch/flash-attention

LetterRip t1_janljeo wrote on March 2, 2023 at 6:49 PM

Reply to comment by lucidraisin in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.

LetterRip t1_jani50o wrote on March 2, 2023 at 6:23 PM

Reply to comment by jinnyjuice in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

We don't know the supply demand curve, so we can't know for sure that the revenue increased.

LetterRip t1_jal4y8i wrote on March 2, 2023 at 5:05 AM

Reply to comment by bjergerk1ng in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Certainly that is also a possibility. Or they might have done teacher student distillation.

LetterRip t1_jal4vgs wrote on March 2, 2023 at 5:04 AM

Reply to comment by cv4u in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Yep, or a mix between the two.

GLM-130B quantized to int4, OPT and BLOOM int8,

https://arxiv.org/pdf/2210.02414.pdf

Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),

Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

https://arxiv.org/pdf/2103.13630.pdf

Here is a talk on auto48 (automatic mixed int4/int8 quantization)

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/

LetterRip t1_jajezib wrote on March 1, 2023 at 9:26 PM

Reply to comment by minimaxir in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.

Memory efficient attention is from a few months ago.

ChatGPT was just introduced a few months ago.

The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.

LetterRip t1_jaj1kp3 wrote on March 1, 2023 at 8:04 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> I have no idea how OpenAI can make money on this.

Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.

A*.3/3 = 10% of the cost.

Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.

So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.

edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation

https://github.com/microsoft/DeepSpeed-MII

Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.