Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint.

https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&v=enabled&s=1622fb8cd7deb5ccd1934c4cc1d66ce696e81f20

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM).

Moreover, you get smooth spike-free carefree training experience (bf16 & Adam):

https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&v=enabled&s=76a4b7a4859ec589f19552f8248ccc44f87a8a1d

As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints):

https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&v=enabled&s=dd34763778a68d70f4079fe391197b07a885f2e5

https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&v=enabled&s=4622ff3c5538cb16b0801d3215f747b64f083623

And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together.

Comments

currentscurrents t1_j4rcc3e wrote on January 17, 2023 at 6:54 PM

#1,388,065

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

bo_peng OP t1_j4rht4i wrote on January 17, 2023 at 7:28 PM

#1,388,385

Replying to currentscurrents (#1,388,065)

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

_Arsenie_Boca_ t1_j4rxdt8 wrote on January 17, 2023 at 9:04 PM

#1,389,223

Replying to bo_peng (#1,388,385)

Is there some more detailed description? Would be interesting to read about these lots of new ideas :)

currentscurrents t1_j4s2n9t wrote on January 17, 2023 at 9:36 PM

#1,389,540

Replying to _Arsenie_Boca_ (#1,389,223)

It looks like he goes into a lot more detail on his github.

limpbizkit4prez t1_j4sarps wrote on January 17, 2023 at 10:27 PM

#1,390,005

What does RWKV stand for?

LetterRip t1_j4sumo7 wrote on January 18, 2023 at 12:42 AM

#1,391,201

Replying to limpbizkit4prez (#1,390,005)

Receptance Weighted Key Value RWKV

timelyparadox t1_j4uig6f wrote on January 18, 2023 at 10:11 AM

#1,394,223

It really wants you to make a chat bot, I think it is self aware and biased

blabboy t1_j4ujuqj wrote on January 18, 2023 at 10:31 AM

#1,394,267

Amazing work, I've been following this for a while. Have you considered putting this into an arxiv whitepaper describing the model + tricks? I've wanted to cite this a couple times, but have had to resort to citing the github repo.

blimpyway t1_j4ulemc wrote on January 18, 2023 at 10:52 AM

#1,394,351

Prior to this, have you experimenting with smaller (== more manageable) variants of this model or previous variants were attempted directly at this scale?