I try the "Alpaca prompt" on RWKV 14B ctx8192, and to my surprise it works out of box without any finetuning (RWKV is a 100% RNN trained on 100% Pile v1 and nothing else):

https://preview.redd.it/fciatottq7oa1.png?width=1046&format=png&auto=webp&v=enabled&s=891904adbadefb5902b86f67098c852da88dc167

You are welcome to try it in RWKV 14B Gradio (click examples below the panel):

https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

Tips: try "Expert Response" or "Expert Long Response" or "Expert Full Response" too.

https://preview.redd.it/qo71b85vq7oa1.png?width=2516&format=png&auto=webp&v=enabled&s=c4b1717754d03e28b4bba01530672935407e7797

===================

ChatRWKV v2 is now using a CUDA kernel to optimize INT8 inference (23 token/s on 3090): https://github.com/BlinkDL/ChatRWKV

Upgrade to latest code and "pip install rwkv --upgrade" to 0.5.0, and set os.environ["RWKV_CUDA_ON"] = '1' in v2/chat.py to enjoy the speed.

The inference speed (and VRAM consumption) of RWKV is independent of ctxlen, because it's an RNN (note: currently the preprocessing of a long prompt takes more VRAM but that can be optimized because we can process in chunks).

Meanwhile I find the latest RWKV-4-Pile-14B-20230313-ctx8192-test1050 model can utilize a long ctx:

https://preview.redd.it/a68dw0hzq7oa1.png?width=398&format=png&auto=webp&v=enabled&s=307e4d7847cb03cab3930b3ea07e9b2f856c9b1c

Comments

You must log in or register to comment.

ThePerson654321 t1_jcjbrg5 wrote on March 17, 2023 at 6:36 AM

Wen paper?

bo_peng OP t1_jcjupnc wrote on March 17, 2023 at 11:00 AM

Soon :) working on it. Meanwhile take a look at https://github.com/ridgerchu/SpikeGPT which is a SNN version of RWKV, so has some explanation in the paper.

FallUpJV t1_jcje89y wrote on March 17, 2023 at 7:10 AM

I don't get this anymore if it's not the model size nor the transformer architecture then what is it?

Models were just not trained enough / not on the right data?

xEdwin23x t1_jcjfnlj wrote on March 17, 2023 at 7:31 AM

First, this is not a "small" model so size DOES matter. It may not be hundreds billion parameters but it's definitely not small imo.

Second, it always has been (about data) astronaut pointing gun meme.

FallUpJV t1_jclpydo wrote on March 17, 2023 at 7:09 PM

Yes it's definitely not small, I meant comparated to the models people have been paying attention to the most on the last few years I guess.

The astronaut pointing gun meme is a good analogy, almost a scary one, I wonder how much we could improve existing models with simply better data.

MysteryInc152 t1_jclpjzi wrote on March 17, 2023 at 7:07 PM

It's predicting language. as long as the structure can allow properly to learn to predict language, you're good to go.

turnip_burrito t1_jcoul9i wrote on March 18, 2023 at 12:41 PM

Yes, exactly. Everyone keeps leaving the architecture's inductive structural priors out of the discussion.

It's not all about data! The model matters too!

satireplusplus t1_jcp6bu4 wrote on March 18, 2023 at 2:20 PM

This model uses a "trick" to efficiently train RNNs at scale and I still I have to take a look to understand how it works. Hopefully the paper is out soon!

Otherwise size is what matters! To get there it's a combination of factors - the transformer architecture scales well and was the first architecture that allowed to train these LLMs cranked up to enormous sizes. Enterprise GPU hardware with lots of memory (40G, 80G) and frameworks like pytorch that make parallelizing training across multiple GPUs easy.

And OPs 14B model might be "small" by today's standard, but its still gigantic compared to a few years ago. It's ~27GB of FP16 weights.

Having access to 1TB of preprocessed text data that you can download right away without doing your own crawling is also neat (pile).

londons_explorer t1_jcj8p9y wrote on March 17, 2023 at 5:56 AM

Can we run things like this through github.com/OpenAI/evals?

They have now got a few hundred tests, which is a good way to gauge performance.

Taenk t1_jckzuxm wrote on March 17, 2023 at 4:23 PM

Sorry, I am not an expert, just an enthusiast, so this is a stupid question: Where can I see a list of these few hundred tests and is there some page where I can see comparisons between different models?

bo_peng OP t1_jcjuvg9 wrote on March 17, 2023 at 11:02 AM

Yeah that will be cool. You are welcome to try it and I can help.

The rwkv pip package: https://pypi.org/project/rwkv/

cipri_tom t1_jcjeehj wrote on March 17, 2023 at 7:13 AM

This is great! It just needs a name that's as great as the work

RWKV is a tongue twister. How about Ruckus?

cipri_tom t1_jcjr8rn wrote on March 17, 2023 at 10:17 AM

Man, ChatRNN
The stars would be pouring over the repo if you named it ChatRNN. People love an antagonist, and "going back to the old days" and proving that was better

bo_peng OP t1_jcjuejz wrote on March 17, 2023 at 10:57 AM

ChatRNN is indeed a great name :)

R W K V are the four major parameters in RWKV (similar to QKV for attention).

I guess you can pronounce it like "Rwakuv" (A bit like racoon)

schwagggg t1_jckar2a wrote on March 17, 2023 at 1:32 PM

i thought it was

“r - dub - kay - vi”

which is a little long but unique

LetterRip t1_jcl6axl wrote on March 17, 2023 at 5:03 PM

Rocky

gliptic t1_jcjpy0h wrote on March 17, 2023 at 10:00 AM

What's wrong with Arveycavey ;).

cipri_tom t1_jcjr74y wrote on March 17, 2023 at 10:17 AM

first vey is not vey ! :)

yehiaserag t1_jcj305q wrote on March 17, 2023 at 4:50 AM

How does that version compare to "RWKV-4-Pile-14B-20230228-ctx4096-test663"?

bo_peng OP t1_jcjuinf wrote on March 17, 2023 at 10:58 AM

More ctxlen and slightly better trained :) same speed & vram

[deleted] t1_jclws3b wrote on March 17, 2023 at 7:54 PM

[deleted]

yehiaserag t1_jcm31zk wrote on March 17, 2023 at 8:36 PM

We say RWKV for short, the rest of the stuff is for a specific version

[deleted] t1_jcs3icv wrote on March 19, 2023 at 3:09 AM

[removed]

[deleted] t1_jcs3lrc wrote on March 19, 2023 at 3:10 AM

[removed]

blueSGL t1_jcjga2i wrote on March 17, 2023 at 7:40 AM

Is it possible to split the model and do inference across multiple lower VRAM GPUs or does a single card have to have the minimum 16gig VRAM?

bo_peng OP t1_jcjuhix wrote on March 17, 2023 at 10:58 AM

Yes ChatRWKV v2 supports that :)

Take a look at the "strategy" guide: https://pypi.org/project/rwkv/

mikljohansson t1_jckedf9 wrote on March 17, 2023 at 2:00 PM

Very interesting work! I've been following this project for a while now

Can I ask a few questions?

What's the difference between RWKV-LM and ChatRWKV, e.g. is ChatRWKV mainly RWKV-LM but streamlined for inference and ease of use, or is there more differences?
Are you planning to fine tune on the Stanford Alpaca dataset (like was recently done for LLaMa and GPT-J to create instruct versions of them), or a similar GPT-generated instruction dataset? I'd love to see a instruct-tuned version of RWKV-LM 14B with a 8k+ context len!

bo_peng OP t1_jcmajpx wrote on March 17, 2023 at 9:26 PM

RWKV-LM is now mainly for training, while ChatRWKV is for optimal inference.
Someone in RWKV Discord tried it using LoRA (https://github.com/Blealtan/RWKV-LM-LoRA) and the result is quite nice. Join RWKV Discord for latest updates :)

sanderbaduk t1_jcjy47g wrote on March 17, 2023 at 11:38 AM

Does it have trouble stopping? I see it ramble on, e.g. https://imgur.com/a/e6k7pSP

bo_peng OP t1_jck4qkr wrote on March 17, 2023 at 12:41 PM

I manually disabled the <|endoftext|> token in the demo, so it can output irrelevant contents after a task is completed :)

yaosio t1_jckchbe wrote on March 17, 2023 at 1:46 PM

I like it's plan to make money. Did it learn from wallstreetbets?

acertainmoment t1_jcm22en wrote on March 17, 2023 at 8:29 PM

tried a small modification to one of the examples on huggingface :)

https://ibb.co/zNS3H1J