Submitted by bo_peng t3_yxt8sa in MachineLearning

Hi everyone. I have finished training RWKV-4 7B (an attention-free RNN LLM) and it can match GPT-J (6B params) performance. Maybe RNN is already all you need :)

https://preview.redd.it/71cce2y75j0a1.png?width=1336&format=png&auto=webp&s=5af76abc4f42fd63f0194ee93f78db01c1b21d97

These are RWKV BF16 numbers. RWKV 3B is better than GPT-neo 2.7B on everything (smaller RWKV lags behind on LAMBADA). Note GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it when both are at 14B.

Previous discussion: https://www.reddit.com/r/MachineLearning/comments/xfup9f/r_rwkv4_scaling_rnn_to_7b_params_and_beyond_with/

RWKV has both RNN & GPT mode. The RNN mode is great for inference. The GPT mode is great for training. Both modes are faster than usual transformer and saves VRAM, because the self-attention mechanism is replaced by simpler (almost linear) formulas. Moreover the hidden state is tiny in the RNN mode and you can use it as an embedding of the whole context.

Github: https://github.com/BlinkDL/RWKV-LM

Checkpt: https://huggingface.co/BlinkDL/rwkv-4-pile-7b

14B in progress (thanks to EleutherAI and Stability). Nice spike-free loss curves:

https://preview.redd.it/w4g7oqmi5j0a1.png?width=868&format=png&auto=webp&s=346d420fb879fd06470079eeaf2e4d3739536406

172

Comments

You must log in or register to comment.

clauwen t1_iwqwjd0 wrote

I have to say, i really like that somebody is doing this, no matter the outcome.

45

ChuckSeven t1_iwqey5x wrote

what is the size of the opt model you are comparing with in that table?

20

Competitive-Rub-1958 t1_iwqmaic wrote

It does need more parameters to compensate (For instance, it has nearly a billion more parameters than GPT-J-6B without substantial performance gains) while losing out on LAMBADA (Ignoring the weighted average as I don't really understand the point of weighing it, since it distorts the metrics).

Its an extremely interesting direction, but I fear as you scale this model the scaling plot might start to flatten out - much like other RNN rewrites/variants. Hope further research is able to pinpoint the underlying issue and fix it. Till then, best of luck to OP! 👍

16

bo_peng OP t1_iwua2xh wrote

RWKV 7B is faster than GPT 6B, and RWKV scales great actually :)

If you check the table, RWKV is better than GPT-neo on everything at 3B (while smaller RWKV lags behind on LAMBADA).

But GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it at 14B.

Moreover RWKV 3B becomes stronger after trained for more tokens and I am doing it for the 7B model too.

8

CKtalon t1_iwqk0b9 wrote

It’s written in the 2nd column (params)

4

violentdeli8 t1_iwrvqkf wrote

I cannot say how much I love that you are doing this!

10

m98789 t1_iwr0bdt wrote

How can it be used in a multi-label text classification task?

3

bo_peng OP t1_iwtscuw wrote

You can try the RNN hidden state

2

OverleafMan t1_iwrxwt5 wrote

Great results! What is the meaning of the first row in your table?

2

Ford_O t1_iwtrw98 wrote

How much faster is RNN on inference than GPTJ?

2

bo_peng OP t1_iwts867 wrote

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

Moreover RWKV-4 is bf16 and faster than 16bit GPT models.

Training speed: RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.

8

Ford_O t1_iwtx6nb wrote

Could you also measure the performance on CPU?

3

Ford_O t1_iwtxi5i wrote

How much smaller are the embeddings?

2

yazriel0 t1_iwvipkh wrote

Great stuff, and much needed!! Transformer are expensive.

Is the RNN mode suitable for update-able-neural-net NNEU used in tree-search games? This is where the next tree node evaluation re-uses the previous node.

2

WikiSummarizerBot t1_iwvir6n wrote

Efficiently updatable neural network

>An efficiently updatable neural network (NNUE, a Japanese wordplay on Nue, sometimes stylised as ƎUИИ) is a neural network-based evaluation function whose inputs are piece-square tables, or variants thereof like the king-piece-square table. NNUE is used primarily for the leaf nodes of the alpha–beta tree. While being slower than handcrafted evaluation functions, NNUE does not suffer from the 'blindness beyond the current move' problem. NNUE was invented by Yu Nasu and introduced to computer shogi in 2018.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

1

guardiantesla t1_iwum4fl wrote

Interesting work. Appreciate your effort. There are few works which use convolutions as well (referred as ConFormer). But I’m not sure if it has been tried in comparing with GPT works.

How do you train such large models (AWS, GCP, etc)? And how much is the estimated cost?

1

turingbook t1_iww0d9x wrote

Why not writing a paper?

0