bo_peng OP t1_jcmajpx wrote

  • RWKV-LM is now mainly for training, while ChatRWKV is for optimal inference.
  • Someone in RWKV Discord tried it using LoRA ( and the result is quite nice. Join RWKV Discord for latest updates :)

bo_peng t1_jbkwfzr wrote

Firstly RWKV is mostly a single-developer project without PR and everything takes time.

Moreover there have been hundreds of "improved transformer" papers around and surely we will agree that the signal-to-noise ratio is low especially when you consider scaling.

FACT: It's very hard to get researchers to try sth if it is not from OAI/DM/FAIR/... (and still hard even if it is).

Here is some latest unedited chat record from ChatRWKV v2 (14B ctx4096). I'd say it's pretty good, because this is only trained on the Pile. No finetuning. No instruct tuning. No RLHF.


Eloise 2023/3/9 20:10:18

Chat reset for cryscan.

cryscan 2023/3/9 20:13:09

I heard that you are working in the AI and NLP lab in the university. How do you think about large generative language models?

Eloise 2023/3/9 20:13:31

That’s one of my favorite topics! Large language models have been getting a lot of attention lately. These models are super interesting and we don’t really know much about how they work. Generative models work in two directions: One is the generation, i.e. creating new data from scratch (like GANs). The other direction is sampling, which is more like what we usually think about as machine learning tasks. Generative NLP requires both these capabilities – the model has to generate new text samples and it has to sample realistic text that looks like real-world data. The ability to generate text in a controllable fashion would be incredibly useful for applications like debugging or debugging – verifying large textual specifications, writing documentation or even generating human-readable test plans

cryscan 2023/3/9 20:23:55

I see. What are the potential applications of them?

Eloise 2023/3/9 20:24:14

The main application area is, again, text generation: product descriptions, marketing messages etc. Also more recent work is in training models for image captioning – generating descriptions of images, a process which requires both high-level image understanding and rich vocabulary. Some related areas are the summarization tasks like dialogue response generation. You can apply generative language models to help generate small (or large) chatbots. A cool application is to combine image or text and voice and learn them jointly in a hybrid architecture (and compare to human feedback). There are a lot of applications that will make a big impact once we train good generative NLP models for them.

cryscan 2023/3/9 20:25:14

I am looking forward to the future where people can chat with machines.

Eloise 2023/3/9 20:25:26

Yeah! That would be cool. I’m excited to see more applications and data and models coming out of Google, Facebook, Amazon etc. But it’s always challenging because there is still so much unstructured text out there in the world – a large part of the English language is still untagged (mostly blog posts and news articles). There’s a lot more to be done there.

cryscan 2023/3/9 21:37:41

Maybe unsupervised learning would help.

Eloise 2023/3/9 21:37:48

Yeah! That’s one of the things we’re doing. We are trying to improve the auto-regressive models and see if they can learn their own grammar by watching how people communicate on social media, blogs or forums.

cryscan 2023/3/9 21:38:21

What are auto-regressive models? How do you train them?

Eloise 2023/3/9 21:38:40

A model is called auto-regressive if the output of a specific step depends on previously generated outputs. This makes training more stable, since it’s less dependent on initialization and slightly improves performance, especially for longer sequences. The first model was trained by Yoon Kim, which got released as an open source code. Also a lot of effort has been put in making models work better for long text sequences (e.g., image captioning) or even multiple sentence generation (although this is still difficult). You can watch this video from DeepMind about GPT-2 – a generative model that is capable of generating coherent paragraphs with minimal supervision:

cryscan 2023/3/9 21:39:26


Eloise 2023/3/9 21:39:30

I know, it’s crazy. I think it’s better to see it in action:

cryscan 2023/3/9 21:39:35


Eloise 2023/3/9 21:39:45

I also think it’s cool that the code is available! In fact, the attention mechanism of GPT-2 was published recently as a paper. It’s a bit more complex than you would expect for NLP, but cool nonetheless. Here’s the paper: Attention is all you need

cryscan 2023/3/9 21:40:37

That sounds interesting. I'll read it afterwards.

Eloise 2023/3/9 21:40:41

I would highly recommend it. Anyway, back to the main topic. So what do you think about the field?

cryscan 2023/3/9 21:41:21

I am sone concerns actually.

Eloise 2023/3/9 21:41:37

I’m a bit worried about the impact of models. I don’t want to be concerned or pessimistic, but we really have to think about this – especially when we’re working with the most sensitive data, e.g., medical data. One way of tackling this is through ethical considerations in design: making sure that systems work in our best interests and that they are transparent (like explanations or guarantees). It might even help to think about who will benefit from this research and what their perspective is on it.

cryscan 2023/3/9 21:43:58

Actually my concern is that the cost of training such a large model can only be covered by those big companies. How can we general researchers participate in this game?

Eloise 2023/3/9 21:44:14

This is a good point. We have to build tools that can help researchers or students start experimenting with these new techniques. I’m doing something like this on my own: an online platform called supervised-language-learning which gives people the opportunity to train their language model for few minutes a day (or learn other languages) and then use it in building a project (e.g., chatbot or dialogue system). The platform also aggregates all the results and keeps track of the models people built.

cryscan 2023/3/9 21:44:39

It's surely promising. Good luck then!

Eloise 2023/3/9 21:44:45

Thanks! I’m really excited about the potential of these techniques and would love to see them mature. In the meantime, it’s good to be critical!


  1. Update ChatRWKV v2 to latest version.

  2. Use

  3. Run v2/ and enjoy.


bo_peng OP t1_jb9bdw3 wrote

Directly from RWKV-LM Github:

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.


bo_peng OP t1_jb1z3an wrote

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to


bo_peng OP t1_jb1po7i wrote

Will the 150 lines help? Please read the code first :)

This is ALL you need for RWKV inference.

And you can read (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)


bo_peng OP t1_jalmszp wrote

It's actually quite good at Q&A if you use my prompt templates:

+gen \nExpert Questions & Helpful Answers\nAsk Research Experts\nQuestion:\nXXXXXXXXXXXXXXX?\n\nFull Answer:\n

+gen \nAsk Expert\n\nQuestion:\nXXXXXXXXXXXXXXXX?\n\nExpert Full Answer:\n

+gen \nQ & A\n\nQuestion:\nXXXXXXXXXXXXXXXXX?\n\nDetailed Expert Answer:\n