KerfuffleV2 t1_jefkhxs wrote

> Something about these distillations feels fundamentally different than when interacting with the larger models.

It may not have anything to do with size. ChatGPT is just adding a lot of comfort-phrases into its response instead of just responding. "Hmm, this is an interesting challenge", "Let's see", etc. Some of that may be based on the system prompt, some of it may be training to specifically produce more natural sounding responses.

All "Hmm", "interesting challenge" and stuff that makes it sound like a person isn't actually adding any actual information that's relevant to answering the query though. (Also, you may be paying for those extraneous tokens.)


KerfuffleV2 t1_jecbxy7 wrote

It's based on Llama, so basically the same problem as anything based on Llama. From the repo "We plan to release the model weights by providing a version of delta weights that build on the original LLaMA weights, but we are still figuring out a proper way to do so." edit: Nevermind.

You will still probably need a way to get a hold of the original Llama weights (which isn't the hardest thing...)


KerfuffleV2 t1_jdy5tok wrote

> if chatgpt had memory, RAM, a network time clock, and a starting prompt, it would be sentient. So it already is.

I feel like you don't really understand how LLMs work. It's not me in a dark room, it literally doesn't do anything until you feed it a token. So there's nothing to be aware of, it's just a bunch of inert floating point numbers.

But even after you give it a token, it doesn't decide to say something. You basically get back a list of every predefined token with a probability associated with it. So that might just be a large array of 30k-60k floats.

At that point, there are various strategies for picking a token. You can just pick the one that has the highest value from the whole list, you can pick one of the top X items from the list randomly, etc. That part of it involves very simple functions that basically any developer could write without too much trouble.

Now, I'm not an expert but I do know a little more than the average person. I actually just got done implementing a simple one based on the RWKV approach rather than transformers:

The first line is the prompt, the rest is from a very small (430M parameter) model:

In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.

The creatures even fought with each other!

The Tibet researchers are calling the dragons “Manchurian Dragons” because of the overwhelming mass of skulls they found buried in a mountain somewhere in Tibet.

The team discovered that the dragon family is between 80 and 140 in number, of which a little over 50 will ever make it to the top.

Tibet was the home of the “Amitai Brahmans” (c. 3800 BC) until the arrival of Buddhism. These people are the ancestor of the Chinese and Tibetan people.

According to anthropologist John H. Lee, “The Tibetan languages share about a quarter of their vocabulary with the language of the Tibetan Buddhist priests.” [end of text]


KerfuffleV2 t1_jd7sb4u wrote

llama.cpp and alpaca.cpp (and also related projects like llama-rs) only use the CPU. So not only are you not getting the most out of your GPU, it's not getting used at all.

I have an old GPU with only 6GB so running larger models on GPU isn't practical for me. I haven't really looked at that aspect of it much. You could start here:

Keep in mind you will need to be pretty decent with technical stuff to be able to get it working based on those instructions even though they are detailed.


KerfuffleV2 t1_jd7rjvf wrote

There are quantized versions at 8bit and 4bit. The 4bit quantized 30B version is 18GB so it will run on a machine with 32GB RAM.

The bigger the model, the more tolerant it seems to quantization so even 1bit quantized models are in the realm of possibility (would probably have to be something like a 120B+ model to really work).


KerfuffleV2 t1_jd57jq9 wrote

Be sure you're look at the number of tokens when you're considering conciseness, since that's what actually matters. I.E. an emoji may have a compact representation on the screen but that doesn't necessarily mean it'll be efficiently tokenized.

Just for example, "🧑🏾‍🚀" from one of the other comments actually is 11 tokens. The word "person" is just one token.

You can experiment here: (non-OpenAI models likely will use a different tokenizer or tokenize text different, but that'll help you get an idea at least.)

Also relevant is that these models are trained to autocomplete text based on probabilities based on the text they were trained with. If you start using or asking them to generate text in a different format, it may well end up causing them to produce much lower quality answers (or understand less of what the user responded).


KerfuffleV2 t1_jd52brx wrote

> there's a number of efforts like llama.cpp/alpaca.cpp or openassistant but the problem is that fundamentally these things require a lot of compute, which you really cant step around.

It's honestly less than you'd expect. I have a Ryzen 5 1600 which I bought about 5 years ago for $200 (it's $79 now). I can run llama 7B on the CPU and it generates about 3 tokens/sec. That's close to what ChatGPT can do when it's fairly busy. Of course, llama 7B is no ChatGPT but still. This system has 32GB RAM (also pretty cheap) and I can run llama 30B as well, although it takes a second or so per token.

So you can't really chat in real time, but you can set it to generate something and come back later.

The 3 or 2 bit quantized versions of 65B or higher models would actually fit in memory. Of course, it would be even slower to run but honestly, it's amazing it's possible to run it at all on 5 year old hardware which wasn't cutting edge even back then.


KerfuffleV2 t1_jd1kfyp wrote

Note: Not the same person.

> I would imagine the OpenGPT reponse is much longer because ... it is just bigger?

llama.cpp recently added a commandline flag to disable the end of message marker from getting generated, so that's one way you can try to force responses to be longer. (It doesn't always work, because the LLM can start generating irrelevant content.)

The length of the response isn't directly related to the size of the model, but just having less information available/relevant could mean it has less to talk about in a response.

> GPT3 model is 128B, does it mean if we get trained model of GPT, and manage to run 128B locally, will it give us the same results?

If you have the same model and you give it the same prompt, you should get the same result. Keep in mind if you're using some other service like ChatGPT you aren't directly controlling the full prompt. I don't know about OpenGPT, but from what I know ChatGPT has a lot of special sauce not just in the training but other stuff like having another LLM write summaries for it so it keeps track of context better, etc.

> Last question, inference means that it gets output from a trained model.

Inference is running a model that's already been trained, as far as I know.

> If my understanding is correct, Alpaca.cpp or are a sort of 'front-end' for these model.

The model is a bunch of data that was generated by training. Something like llama.cpp is what actually uses that data: keeping track of the state, parsing user input into tokens that can be fed to the model, performing the math calculations that are necessary to evaluate its state, etc.

"Gets its output from", "front end" sound like kind of weird ways to describe what's going on. Just as an example, modern video formats and compression for video/audio is pretty complicated. Would you say that a video player "gets its output" from the video file or is a front-end for a video file?

> The question I am trying to ask is, what is so great about llama.cpp?

I mean, it's free software that works pretty well and puts evaluating these models in reach of basically everyone. That's great. It's also quite fast for something running purely on CPU. What's not great about that?

> I know there is Rust version of it out, but it uses llama.cpp behind the scene.

I don't think this is correct. It is true that the Rust version is (or started out) as a port of the C++ version but it's not using it behind the scenes. However, there's a math library called GGML that both programs use, it does the heavy lifting of doing the calculations for the data in the models.

> Is there any advantage of an inference to be written in Go or Python?

Same advantage as writing anything in Go, which is... Just about nothing in my opinion. See:

Seriously though, this is a very, very general question and can be asked about basically any project and any set of programming languages. There are strengths and weaknesses. Rust's strength is high performance, ability to do low level stuff like C, and it has a lot of features aimed at writing very reliable software that handles stuff like edge cases. This comes at the expense of having to deal with all those details. On the other hand, a language like Python is very high level. You can just throw something together and ignore a lot of details and it still can work (unless it runs into an unhandled case). It's generally a lot slower than languages like Rust, C, C++ and even Go.

However, for running LLMs, most of the processing is math calculations and that will mean calling into external libraries/modules that will be written in high performance languages like C, Rust, etc. Assuming a Python program is taking advantage of that kind of resource, I wouldn't expect it to be noticeably slow.

So, like a lot of the time, it comes down to personal preference of what the developer wants to use. The person who wrote the Rust version probably like Rust. The person who wrote the C++ version likes C++, etc.


KerfuffleV2 t1_jcp7qcz wrote

I'm not sure I fully understand it, but it seems like it's just basically adding context to the prompt it submits with requests. For obvious reasons, the prompt can only get so big. It also requires making requests to OpenAI's embedding API which isn't free: so it's both pushing in more tokens and making those extra requests.

I can definitely see how that approach could produce better results, but it's also not really unlimited memory. Note: I skimmed the source, but I'm not really a C++ person and I didn't actually set it up to use my OpenAI account via API.


KerfuffleV2 t1_jclo0oh wrote

I'm not an ML person, but it seems like that paper is just teaching the LLM to simulate a Turing machine. Actually making it respond normally while doing practical stuff like answering user queries would be a different thing.

Also, suppose the LLM has access to external memory. First, you have to teach it how to interact with that external memory (via special command sequences in its tokens, most likely). Then you have to teach it/take steps to make it appropriately note which things are important or not and store/retrieve them as necessary. All of this requires tokens for input/output so it will increase processing time even when used perfectly, these tokens will also consume the existing context window.

One really big thing with LLMs now is it seems like they don't (and maybe can't) know what they know/don't know. They just predict tokens, they can't really do introspection. Of course, they can be trained to respond that they don't know certain things, but getting the LLM to decide it needs to use the external memory doesn't seem like the simplest thing.

I mean, take humans as an example: Are you effective at taking notes, organizing them in a way that lets you easily recall them in the future, etc? It's not even an easy skill for humans to develop, and we're relatively good at knowing what we don't know.

Another thing is the paper you linked to says it set the temperature to 0, to make the responses very deterministic. Generally this makes them a lot less creative as well. If you turn up temperature, you potentially increase the chances that the LLM generates malformed queries for the external memory or stuff like that.

Anyway, I don't know much about the technical side of increasing the context window but when the context window is bigger the thing can just use it as far as I know. Taking advantage of some sort of external memory system seems like it's a very, very complicated thing to solve reliably.

Again, note this is coming from someone that doesn't really know much about ML, LLMs, etc. I'm just a normal developer, so take all this with a grain of salt.


KerfuffleV2 t1_jccb5v1 wrote

Sounds good! The 4bit stuff seems pretty exciting too.

By the way, not sure if you saw it but it looks like PyTorch 2.0 is close to being released:

They seem to be claiming you can just drop in torch.compile() and see benefits with no code changes.


KerfuffleV2 t1_jc3jith wrote

> Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

Nice, that makes a big difference! (And such a small change too.)

The highest speed I've seen so far is with something like cuda fp16i8 *15+ -> cuda fp16 *1 at about 1.21tps edit: I was mistaken, it was actually 1.17. Even cuda fp16i8 *0+ gets quite acceptable speed (.85-.88tps) and uses around 1.3GB VRAM.

I saw your response on GitHub. Unfortunately, I don't use Discord so hopefully it's okay to reply here.


KerfuffleV2 t1_jc1jtg5 wrote


I didn't want to clutter up the issue here:

In case this information is useful for you:

strategy time tps tokens
cuda fp16 *0+ -> cuda fp16 *10 45.44 1.12 51
cuda fp16 *0+ -> cuda fp16 *5 43.73 0.94 41
cuda fp16 *0+ -> cuda fp16 *1 52.7 0.83 44
cuda fp16 *0+ -> cpu fp32 *1 59.06 0.81 48
cuda fp16i8 *12 -> cuda fp16 *0+ -> cpu fp32 *1 65.41 0.69 45

I ran the tests using this frontend:

It was definitely using rwkv version 0.3.1

env RKWV_JIT_ON=1 python \ 
  --rwkv-cuda-on \ 
  --rwkv-strategy STRATEGY_HERE \ 
  --model RWKV-4-Pile-7B-20230109-ctx4096.pth 

For each test, I let it generate a few tokens first to let it warm up, then stopped it and let it generate a decent number. Hardware is a Ryzen 5 1600, 32GB RAM, GeForce GTX 1060 6GB VRAM.

Surprisingly, streaming everything as fp16 was still faster than putting 12 fp16i8 layers in VRAM. A 1060 is a pretty old card, so maybe it has unusual behavior dealing with that format. I'm not sure.


KerfuffleV2 t1_jc18f6a wrote

Huh, that's weird. You can try reducing the first one from 7 to 6 or maybe even 5:

cuda fp16 *6 -> cuda fp16 *0+ -> cpu fp32 *1

Also, be sure to double check for typos. :) Any incorrect numbers/punctuation will probably cause problems. Especially the "+" in the second part.


KerfuffleV2 t1_jbz7yfk wrote

I've been playing with this for a bit and I actually haven't found any case where fp16i8 worked better than halving the layers and using fp16.

If you haven't already tried it, give something like cuda fp16 *7 -> cuda fp16 *0+ -> cpu fp32 *1 a try and see what happens. It's around twice as fast as cuda fp16i8 *16 -> cpu fp32 for me, which is surprising.

That one will use 7 fp16 layers on the GPU, and stream all the rest except the very last as fp16 on the GPU also. The 33rd layer gets run on the CPU. Not sure if that last part makes a big difference.


KerfuffleV2 t1_jbrb0qa wrote

I'm definitely not qualified to answer a question like that. I'm just a person that managed to get it working on a 6G VRAM GPU. Basically, as far as I understand the more you can run on the GPU, the better. So it really depends on what other stuff you have using your GPU's memory.

Like I mentioned, when I got it working I already had about 1.25G used by other applications and my desktop environment. From my calculations, it should be possible to fit 21, maybe 22 layers onto the GPU as long as nothing else is using it (so basically, you'd have to be in text mode with no desktop environment running).

If you're using Linux and an Nvidia card then you can try install an application called nvtop — it can show stuff like VRAM usage, etc. The way to install it will be specific to your distribution, so I can't help you with that. If you're using Windows or a different OS I can't really help you either.

But anyway, if you can find how much VRAM you have free, you can look at how much of that loading 16 layers uses and calculate how many more you can add before you run out.

That's still not necessarily going to be optimal though. I don't know how stuff like the difference in speed/precision for fp16 vs fp16i8 works or stuff like that. It's not impossible there's some other combination of parameters that would be better in some way than just trying to as much as possible onto the GPU in fp16i8 format. You'd have to ask someone more knowledgeable for a real answer.


KerfuffleV2 t1_jbr6r2f wrote

> I'm actually using the oobabooga text generation webui on github

I'm not familiar with that. It does seem like it can use RWKV and supports passing strategy though:

Are you already using that flag with the correct parameter?


KerfuffleV2 t1_jbqtx6j wrote

Note: I'm just a random person on the internet, no affiliation to OP. I also don't really know what I'm doing here, so follow my advice at your own risk.

cuda fp16i8 *16 -> cpu fp32 as the strategy means use 16 fp16i8 format CUDA layers and then put the rest on the CPU (as fp32). So if you want to reduce how many layers go to the GPU, you'd reduce "16" there.

Assuming we're talking about the same thing, you'd have the ChatRWKV repo checked out and be editing v2/

There should be a line like:

args.strategy = 'cuda fp16i8 *16 -> cpu fp32'

Either make sure other other lines setting args.strategy in that area are commented out or make sure the one with the setting you want to use is the last one. (Otherwise the other variable assignment statements would override what you added.)