Submitted by ThePerson654321 t3_11lq5j4 in MachineLearning
LetterRip t1_jbkmk5e wrote
Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> He makes it sound extraordinary
The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.
> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?
Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,
> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.
> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )
Viewing a single comment thread. View all comments