Submitted by bo_peng t3_11f9k5g in MachineLearning

Hi everyone. Now ChatRWKV v2 can split RWKV to multiple GPUs, or stream layers (compute layer-by-layer), so you can run RWKV 14B with as few as 3G VRAM. https://github.com/BlinkDL/ChatRWKV

Example:

'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32' = first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32

'cuda fp16 *20+' = first 20 layers on cuda fp16, then stream the rest on it

And RWKV is now a pip package: https://pypi.org/project/rwkv/

os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV
from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV
# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')
ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
    print(s, end='', flush=True)
# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
    alpha_frequency = 0.25,
    alpha_presence = 0.25,
    token_ban = [0], # ban the generation of some tokens
    token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)

Right now all RWKV models are still trained with GPT-like method, so they are limited by the ctxlen used in training, even though in theory they should have almost infinite ctxlen (because they are RNNs). However RWKV models can be easily finetuned to support longer ctxlens (and large models actually use the ctxlen). I have finetuned 1B5/3B/7B/14B to ctx4K, and now finetuning 7B/14B to ctx8K, and 14B to ctx16K after that :) All models are available at https://huggingface.co/BlinkDL

The core RWKV is still mostly an one-man project, but a number of great developers are building on top of it, and you are welcome to join our community :)

89

Comments

You must log in or register to comment.

satireplusplus t1_jaiwxlo wrote

Wow, nice, I will try it out!

Btw: If you want to format your code in your post, you need to add 4 spaces in front of any line in your post. Otherwise all newlines are lost.

Lines starting with four spaces are treated like code:

if 1 * 2 < 3:
    print("hello, world!")
7

bo_peng OP t1_jaixxp5 wrote

Thank you :) I was using the markdown mode instead because I didn't know this

2

KerfuffleV2 t1_jaiz1k8 wrote

Unfortunately, that doesn't work on the old reddit layout. We just see a garbled mess.

Here's a fixed version of the code/examples:


(not my content)

Example:

'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32' = first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32

'cuda fp16 *20+' = first 20 layers on cuda fp16, then stream the rest on it


os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' #  if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV

from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV

# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')

ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in     Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
    print(s, end='', flush=True)

# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
                     alpha_frequency = 0.25,
                     alpha_presence = 0.25,
                     token_ban = [0], # ban the generation of some tokens
                     token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)

I kind of want to know what happens in the story...

3

bo_peng OP t1_jaj2pr2 wrote

strange. all spaces are lost even when i add 4 spaces in front of all code lines

UPDATE: works in markdown editor :)

2

ID4gotten t1_jalb9vx wrote

It's not the best at Q&A or chat (yet), but kudos for all the work behind this super interesting approach. Maybe with time it will continue to improve, and I like seeing non-transformer methods showing some potential.

6

bo_peng OP t1_jalmszp wrote

It's actually quite good at Q&A if you use my prompt templates:

+gen \nExpert Questions & Helpful Answers\nAsk Research Experts\nQuestion:\nXXXXXXXXXXXXXXX?\n\nFull Answer:\n

+gen \nAsk Expert\n\nQuestion:\nXXXXXXXXXXXXXXXX?\n\nExpert Full Answer:\n

+gen \nQ & A\n\nQuestion:\nXXXXXXXXXXXXXXXXX?\n\nDetailed Expert Answer:\n

7

Select_Beautiful8 t1_jbifwzt wrote

I have one 6GB vram GPU, which model should I use?

1

bo_peng OP t1_jbij8ky wrote

Try 7B ctx4096 first

2

Select_Beautiful8 t1_jbijcjl wrote

I tried the 3B and it said out of memory. I'm now trying 1B5 and it loads correctly.

1

bo_peng OP t1_jbiq52c wrote

Please set "strategy" for you GPU.

Try this strategy for 3B first:

'cuda fp16i8 *12 -> cuda fp16' # first 12 layers cuda fp16i8, then cuda fp16

Reduce 12 as you could, to get better speed.

2

Select_Beautiful8 t1_jbk1hwd wrote

Thanks, I will try it

1

KerfuffleV2 t1_jboquv7 wrote

If it helps, I was able to get the 7B model going on a GTX 1060 with 6GB VRAM also. The strategy I used wascuda fp16i8 *16 -> cpu fp32 — starting out with about 1.2G vram already in use from other programs and desktop environment, it went up to about 5.6G which would be about 0.275G/layer. So on a 6GB card with fp16i8 it seems like even with totally free VRAM you could load 21, maybe 22 layers at the maximum and half that for the normal fp16 format. This model: RWKV-4-Pile-7B-20230109-ctx4096

It generates a token every 2-3sec which is is too slow for interactive use but still pretty impressive considering the model size and how old the hardware is (my CPU is just a Ryzen 5 1600 also). It's also running half the layers on the CPU. By the way, it also uses about 14GB RAM to run, so you'll need a decent amount of system memory available as well.

Tagging /u/bo_peng also in case this information is helpful for them. (One interesting thing I noticed is the GPU was only being used about 50% of the time, I guess while the CPU inference was run. I don't know if it's possible, but if there was some way to do both in parallel it seems like it would roughly double the speed of token generation.)

2

Select_Beautiful8 t1_jbp7qq7 wrote

Thanks. I have a laptop 3060 and 16GB of RAM, and I successfully ran the 3B one; I will try with the 7B one.

1

Select_Beautiful8 t1_jbq9m13 wrote

No, I wasn't able to load the 7B model, it still says CUDA out of memory :(

1

KerfuffleV2 t1_jbqo9qh wrote

You might have to reduce the CUDA layers by 1-3, but with only 16GB RAM you're probably going to have trouble.

If you still run out of CUDA memory trying to load it, then maybe you're not setting the strategy correctly. How are you trying to change it?

2

Select_Beautiful8 t1_jbqpd5x wrote

>How do I reduce the CUDA layers?

1

KerfuffleV2 t1_jbqtx6j wrote

Note: I'm just a random person on the internet, no affiliation to OP. I also don't really know what I'm doing here, so follow my advice at your own risk.

cuda fp16i8 *16 -> cpu fp32 as the strategy means use 16 fp16i8 format CUDA layers and then put the rest on the CPU (as fp32). So if you want to reduce how many layers go to the GPU, you'd reduce "16" there.

Assuming we're talking about the same thing, you'd have the ChatRWKV repo checked out and be editing v2/chat.py

There should be a line like:

args.strategy = 'cuda fp16i8 *16 -> cpu fp32'

Either make sure other other lines setting args.strategy in that area are commented out or make sure the one with the setting you want to use is the last one. (Otherwise the other variable assignment statements would override what you added.)

2

Select_Beautiful8 t1_jbqyth8 wrote

Thanks. I'm actually using the oobabooga text generation webui on github

1

KerfuffleV2 t1_jbr6r2f wrote

> I'm actually using the oobabooga text generation webui on github

I'm not familiar with that. It does seem like it can use RWKV and supports passing strategy though: https://github.com/oobabooga/text-generation-webui/wiki/RWKV-model#setting-a-custom-strategy

Are you already using that flag with the correct parameter?

2

Select_Beautiful8 t1_jbr867y wrote

Oh it loaded, it was because I wrote "cuda fp32" instead of "cpu fp32" in the second half of the argument. Thanks

1

KerfuffleV2 t1_jbr95r5 wrote

No problem. fp16i8 uses about half the memory of fp16, so what you had would not only use 4x as much memory but it would try to put everything on the GPU!

2

Select_Beautiful8 t1_jbra2af wrote

ok so "cuda fp16i8 *16 -> cpu fp32" would be the most optimal argument for me?

1

KerfuffleV2 t1_jbrb0qa wrote

I'm definitely not qualified to answer a question like that. I'm just a person that managed to get it working on a 6G VRAM GPU. Basically, as far as I understand the more you can run on the GPU, the better. So it really depends on what other stuff you have using your GPU's memory.

Like I mentioned, when I got it working I already had about 1.25G used by other applications and my desktop environment. From my calculations, it should be possible to fit 21, maybe 22 layers onto the GPU as long as nothing else is using it (so basically, you'd have to be in text mode with no desktop environment running).

If you're using Linux and an Nvidia card then you can try install an application called nvtop — it can show stuff like VRAM usage, etc. The way to install it will be specific to your distribution, so I can't help you with that. If you're using Windows or a different OS I can't really help you either.

But anyway, if you can find how much VRAM you have free, you can look at how much of that loading 16 layers uses and calculate how many more you can add before you run out.

That's still not necessarily going to be optimal though. I don't know how stuff like the difference in speed/precision for fp16 vs fp16i8 works or stuff like that. It's not impossible there's some other combination of parameters that would be better in some way than just trying to as much as possible onto the GPU in fp16i8 format. You'd have to ask someone more knowledgeable for a real answer.

2

Select_Beautiful8 t1_jbrbor0 wrote

Thanks, I use Windows, but I want to do a dual boot

1

KerfuffleV2 t1_jbz7yfk wrote

I've been playing with this for a bit and I actually haven't found any case where fp16i8 worked better than halving the layers and using fp16.

If you haven't already tried it, give something like cuda fp16 *7 -> cuda fp16 *0+ -> cpu fp32 *1 a try and see what happens. It's around twice as fast as cuda fp16i8 *16 -> cpu fp32 for me, which is surprising.

That one will use 7 fp16 layers on the GPU, and stream all the rest except the very last as fp16 on the GPU also. The 33rd layer gets run on the CPU. Not sure if that last part makes a big difference.

2

Select_Beautiful8 t1_jc0w1px wrote

This gave me the "out if memory" error again, which did not happen with the "cuda fp18i8 *16 -> cpu fp32" :(

1

KerfuffleV2 t1_jc18f6a wrote

Huh, that's weird. You can try reducing the first one from 7 to 6 or maybe even 5:

cuda fp16 *6 -> cuda fp16 *0+ -> cpu fp32 *1

Also, be sure to double check for typos. :) Any incorrect numbers/punctuation will probably cause problems. Especially the "+" in the second part.

2

Select_Beautiful8 t1_jc9lckr wrote

just got time to try it, but it doesn't load nor does it give error message :( Thanks anyways for your help!

1

KerfuffleV2 t1_jc1jtg5 wrote

/u/bo_peng

I didn't want to clutter up the issue here: https://github.com/BlinkDL/ChatRWKV/issues/30#issuecomment-1465226569

In case this information is useful for you:

strategy time tps tokens
cuda fp16 *0+ -> cuda fp16 *10 45.44 1.12 51
cuda fp16 *0+ -> cuda fp16 *5 43.73 0.94 41
cuda fp16 *0+ -> cuda fp16 *1 52.7 0.83 44
cuda fp16 *0+ -> cpu fp32 *1 59.06 0.81 48
cuda fp16i8 *12 -> cuda fp16 *0+ -> cpu fp32 *1 65.41 0.69 45

I ran the tests using this frontend: https://github.com/oobabooga/text-generation-webui

It was definitely using rwkv version 0.3.1

env RKWV_JIT_ON=1 python server.py \ 
  --rwkv-cuda-on \ 
  --rwkv-strategy STRATEGY_HERE \ 
  --model RWKV-4-Pile-7B-20230109-ctx4096.pth 

For each test, I let it generate a few tokens first to let it warm up, then stopped it and let it generate a decent number. Hardware is a Ryzen 5 1600, 32GB RAM, GeForce GTX 1060 6GB VRAM.

Surprisingly, streaming everything as fp16 was still faster than putting 12 fp16i8 layers in VRAM. A 1060 is a pretty old card, so maybe it has unusual behavior dealing with that format. I'm not sure.

1

bo_peng OP t1_jc2alfm wrote

Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

2

KerfuffleV2 t1_jc3jith wrote

> Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

Nice, that makes a big difference! (And such a small change too.)

The highest speed I've seen so far is with something like cuda fp16i8 *15+ -> cuda fp16 *1 at about 1.21tps edit: I was mistaken, it was actually 1.17. Even cuda fp16i8 *0+ gets quite acceptable speed (.85-.88tps) and uses around 1.3GB VRAM.

I saw your response on GitHub. Unfortunately, I don't use Discord so hopefully it's okay to reply here.

1

bo_peng OP t1_jc9gf72 wrote

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'

for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)

2

KerfuffleV2 t1_jcadn3g wrote

Unfortunately, it doesn't compile for me: https://github.com/BlinkDL/ChatRWKV/issues/38

I'm guessing even if you implement special support for lower compute versions that will probably cancel out the speed (and maybe size) benefits.

1

bo_peng OP t1_jcb05e8 wrote

stay tuned :) will fix it

2

KerfuffleV2 t1_jccb5v1 wrote

Sounds good! The 4bit stuff seems pretty exciting too.

By the way, not sure if you saw it but it looks like PyTorch 2.0 is close to being released: https://www.reddit.com/r/MachineLearning/comments/11s58n4/n_pytorch_20_our_next_generation_release_that_is/

They seem to be claiming you can just drop in torch.compile() and see benefits with no code changes.

1

bo_peng OP t1_jccc46c wrote

I am using torch JIT so close ;)

1