Viewing a single comment thread. View all comments

bo_peng OP t1_jb1z3an wrote on March 5, 2023 at 8:40 PM

Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.