Viewing a single comment thread. View all comments

bo_peng OP t1_j4rht4i wrote

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

12

mrconter1 t1_j4wq1zs wrote

How does the memory scale with the context window size?

1