ThePerson654321
ThePerson654321 OP t1_jbk8kxy wrote
Reply to comment by farmingvillein in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
I'm basically just referring to the claims by the developer. He makes it sound extraordinary:
> best of RNN and transformer, great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
> Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.
The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?
I definitely agree with that their might be a incompatibility with the already existing transformer specific infrastructure.
But thanks for your answer. It might be one or more of the following:
- The larger organizations hasn't noticed/cared about it yet
- I overestimate how good it is (from the developer's description)
- It has some unknown flaw that's not obvious to me and not stated in the repository's ReadMe.
- All the existing infrastructure is tailored for transformers and is not compatible with RWKV
At least we'll see in time.
ThePerson654321 OP t1_jbk6nb4 wrote
Reply to comment by farmingvillein in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
Thanks! I also find it very unlikely that nobody from a large organisation (Openai, Microsoft, Google Brain, Deepmind, Meta, etc) would have noticed it.
ThePerson654321 OP t1_jbjz508 wrote
Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. So prior to a month ago they didn't know it existed or happened to meet their use case.
That surprises me considering his RWKV repo/repos has thousands of stars on GitHub.
I'm curious about what they responded with. What did they say?
> There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.
According to his claim (especially infinite ctx len) it definitely was interesting. That it was scaling was pretty obvious even at 7B.
But your argument is basically that no large organization simply has noticed it yet.
My guess is that it actually has some unknown problem/limitation that makes it inferior to the transformer architecture.
We'll just have to wait. Hopefully you are right but I doubt it.
ThePerson654321 OP t1_jbjisn7 wrote
Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
-
Sure. RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer. Comparing to, say, DALL-E 2 (that has exploded) which only came out 9 months ago it still feels like some organization would have picked RVWK if it was as useful as the developer claim.
-
This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.
-
Not necessarily. Google, OpenAI, Deepmind etc tests things that doesn't work out all the time.
-
Does not matter. If your idea is truly good you will get at attention sooner or later anyways.
I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.
I personally have two potential explainations to my question:
- It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
- The community is basically really slow to embrace this due to some unknown reason.
I am leaning towards the first one.
Submitted by ThePerson654321 t3_11lq5j4 in MachineLearning
ThePerson654321 t1_jack0qw wrote
Reply to comment by TurbulentApricot6994 in Bio-computronium computer learns to play pong in 5 minutes by [deleted]
Check upvote/downvote ratio ^
ThePerson654321 t1_jabdk7a wrote
Reply to comment by No_Ninja3309_NoNoYes in Bio-computronium computer learns to play pong in 5 minutes by [deleted]
I agree! It's sad to see that Pong, the game that started it all, isn't taken seriously anymore. It deserves respect for paving the way for the entire gaming industry and being a damn good game. The mechanics are elegant, and it rewards skill and practice. We've become too obsessed with flashy graphics and complex mechanics, forgetting that sometimes the simplest things can be the most enjoyable. Let's remind people that Pong is a classic game that deserves to be celebrated and remembered.
ThePerson654321 t1_jabdads wrote
Reply to comment by TurbulentApricot6994 in Bio-computronium computer learns to play pong in 5 minutes by [deleted]
It's mine and fuck off
ThePerson654321 t1_j62nfth wrote
Reply to comment by ImpossibleSnacks in MusicLM: Generating Music From Text (Google Research) by nick7566
Yeah it's amazing. I don't write songs or play any instruments. But it feels amazing that I will be able to use this tool just as you and create just as good music. Looking forward to it.
ThePerson654321 t1_j280czc wrote
Reply to comment by artoftheproblem in [R] LAMBADA: Backward Chaining for Automated Reasoning in Natural Language - Google Research 2022 - Significantly outperforms Chain of Thought and Select Inference in terms of prediction accuracy and proof accuracy. by Singularian2501
It feels like we are hitting a wall though...
ThePerson654321 t1_j2804lq wrote
Reply to comment by veejarAmrev in [D] Is Anthropic influential in research? by adventurousprogram4
You should read LessWrong
ThePerson654321 t1_j280380 wrote
Reply to comment by AGI_aint_happening in [D] Is Anthropic influential in research? by adventurousprogram4
Why don't you think the issue rationalists try to raise is important in terms of AGI?
ThePerson654321 t1_iyiib9v wrote
Reply to comment by purplebrown_updown in OpenAI ChatGPT [R] by Sea-Photo5230
It still cannot tell a joke 😅
ThePerson654321 t1_iyanvoe wrote
Reply to comment by [deleted] in [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga
Oh boy here we go again...
ThePerson654321 t1_iyanu6k wrote
Reply to comment by learn-deeply in [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga
Do you think there's a reason why LW has a community dedicated to make fun of things they say compared to say /r/machinelearning?
ThePerson654321 t1_iwy0ki4 wrote
Reply to comment by bo_peng in [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng
So again. What is the disadvantage with using your method?
ThePerson654321 t1_iwn3fdj wrote
Reply to comment by red75prime in MIT researchers solved the differential equation behind the interaction of two neurons through synapses to unlock a new type of fast and efficient artificial intelligence algorithms by Dr_Singularity
It really is amazing how naive you are.
ThePerson654321 t1_irbck16 wrote
Reply to comment by cleverestx in [R] Google announces Imagen Video, a model that generates videos from text by Erosis
They said the same thing about nuclear fusion reactors.
ThePerson654321 t1_iraro8a wrote
Are you being ironic?
ThePerson654321 t1_ir9adad wrote
Reply to comment by master3243 in [R] Google announces Imagen Video, a model that generates videos from text by Erosis
Perhaps a few seconds but never a full movie.
ThePerson654321 t1_ir98sqw wrote
Reply to comment by master3243 in [R] Google announces Imagen Video, a model that generates videos from text by Erosis
I find it difficult to believe we will achieve the same video fidelity compared to image generation.
ThePerson654321 t1_jcjbrg5 wrote
Reply to [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Wen paper?