bo_peng OP t1_j6gnqrp wrote on January 30, 2023 at 4:32 AM

Reply to comment by Gody_Godee in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

2006.16236 is bad at any nontrivial task such as language modeling.

bo_peng OP t1_j61fdtp wrote on January 27, 2023 at 12:54 AM

Reply to comment by Gody_Godee in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

No. It's highly competitive.

bo_peng OP t1_j4rht4i wrote on January 17, 2023 at 7:28 PM

Reply to comment by currentscurrents in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

bo_peng OP t1_iwwapqr wrote on November 18, 2022 at 8:57 PM

Reply to comment by Sylv__ in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

What we have at this moment:

https://github.com/huggingface/transformers/issues/17230

bo_peng OP t1_iwua2xh wrote on November 18, 2022 at 12:04 PM

Reply to comment by Competitive-Rub-1958 in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

RWKV 7B is faster than GPT 6B, and RWKV scales great actually :)

If you check the table, RWKV is better than GPT-neo on everything at 3B (while smaller RWKV lags behind on LAMBADA).

But GPT-J is using rotary and thus quite better than GPT-neo, so I expect RWKV to surpass it at 14B.

Moreover RWKV 3B becomes stronger after trained for more tokens and I am doing it for the 7B model too.

bo_peng OP t1_iwtscuw wrote on November 18, 2022 at 7:58 AM

Reply to comment by m98789 in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

You can try the RNN hidden state

bo_peng OP t1_iwts867 wrote on November 18, 2022 at 7:56 AM

Reply to comment by Ford_O in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

Moreover RWKV-4 is bf16 and faster than 16bit GPT models.

Training speed: RWKV-4 1.5B BF16 ctxlen1024 = 106K tokens/s on 8xA100 40G.

bo_peng OP t1_iwqlumm wrote on November 17, 2022 at 4:50 PM

Reply to comment by ChuckSeven in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

OPT 6.7B