Viewing a single comment thread. View all comments

Straight-Comb-6956 t1_jd08cq1 wrote

LLaMa/Alpaca work just fine on CPU with llama.cpp/alpaca.cpp. Not very snappy(1-15 tokens/s depending on model size), but fast enough for me.

39

lurkinginboston t1_jd0zr7c wrote

I will assume you are much more knowledgeable than I am in this space.. have few basic questions that have been bothering me since all the craze started around GPT and LLM recently.

I managed to get Alpaca working on my end using the above link and get very good result. LLaMa biggest takeaway was it is able to reproduce quality comparable to GPT and much lower compute size. If this is the case, why is the ouput much shorter on LLaMa than what I get on OpenGPT? I would imagine the OpenGPT reponse is much longer because ... it is just bigger? What is the limiting factor to not for us to get longer generated response comparable to GPT?

ggml-alpaca-7b-q4.bin is only 4 gigabyt - I guess this what it means by 4bit and 7 billion parameter. Not sure if rumor or fact, GPT3 model is 128B, does it mean if we get trained model of GPT, and manage to run 128B locally, will it give us the same results? Will it be possible to retrofit GPT model within Alpaca.cpp with minor enhancement to get output JUST like OpenGPT? I have read to fit 128B, it requires muliple Nvidia A100.

Last question, inference means that it gets output from a trained model. Meta/OpenAI/Stability.ai have the resources to train a model. If my understanding is correct, Alpaca.cpp or https://github.com/ggerganov/llama.cpp are a sort of 'front-end' for these model. They allow us to provide an input to get an output by inference with the model. The question I am trying to ask is, what is so great about llama.cpp? Is it because it's in C? I know there is Rust version of it out, but it uses llama.cpp behind the scene. Is there any advantage of an inference to be written in Go or Python?

10

KerfuffleV2 t1_jd1kfyp wrote

Note: Not the same person.

> I would imagine the OpenGPT reponse is much longer because ... it is just bigger?

llama.cpp recently added a commandline flag to disable the end of message marker from getting generated, so that's one way you can try to force responses to be longer. (It doesn't always work, because the LLM can start generating irrelevant content.)

The length of the response isn't directly related to the size of the model, but just having less information available/relevant could mean it has less to talk about in a response.

> GPT3 model is 128B, does it mean if we get trained model of GPT, and manage to run 128B locally, will it give us the same results?

If you have the same model and you give it the same prompt, you should get the same result. Keep in mind if you're using some other service like ChatGPT you aren't directly controlling the full prompt. I don't know about OpenGPT, but from what I know ChatGPT has a lot of special sauce not just in the training but other stuff like having another LLM write summaries for it so it keeps track of context better, etc.

> Last question, inference means that it gets output from a trained model.

Inference is running a model that's already been trained, as far as I know.

> If my understanding is correct, Alpaca.cpp or https://github.com/ggerganov/llama.cpp are a sort of 'front-end' for these model.

The model is a bunch of data that was generated by training. Something like llama.cpp is what actually uses that data: keeping track of the state, parsing user input into tokens that can be fed to the model, performing the math calculations that are necessary to evaluate its state, etc.

"Gets its output from", "front end" sound like kind of weird ways to describe what's going on. Just as an example, modern video formats and compression for video/audio is pretty complicated. Would you say that a video player "gets its output" from the video file or is a front-end for a video file?

> The question I am trying to ask is, what is so great about llama.cpp?

I mean, it's free software that works pretty well and puts evaluating these models in reach of basically everyone. That's great. It's also quite fast for something running purely on CPU. What's not great about that?

> I know there is Rust version of it out, but it uses llama.cpp behind the scene.

I don't think this is correct. It is true that the Rust version is (or started out) as a port of the C++ version but it's not using it behind the scenes. However, there's a math library called GGML that both programs use, it does the heavy lifting of doing the calculations for the data in the models.

> Is there any advantage of an inference to be written in Go or Python?

Same advantage as writing anything in Go, which is... Just about nothing in my opinion. See: https://fasterthanli.me/articles/i-want-off-mr-golangs-wild-ride

Seriously though, this is a very, very general question and can be asked about basically any project and any set of programming languages. There are strengths and weaknesses. Rust's strength is high performance, ability to do low level stuff like C, and it has a lot of features aimed at writing very reliable software that handles stuff like edge cases. This comes at the expense of having to deal with all those details. On the other hand, a language like Python is very high level. You can just throw something together and ignore a lot of details and it still can work (unless it runs into an unhandled case). It's generally a lot slower than languages like Rust, C, C++ and even Go.

However, for running LLMs, most of the processing is math calculations and that will mean calling into external libraries/modules that will be written in high performance languages like C, Rust, etc. Assuming a Python program is taking advantage of that kind of resource, I wouldn't expect it to be noticeably slow.

So, like a lot of the time, it comes down to personal preference of what the developer wants to use. The person who wrote the Rust version probably like Rust. The person who wrote the C++ version likes C++, etc.

13

keeplosingmypws t1_jd5xygm wrote

I have the 16B parameter version of Alpaca.cpp (and a copy of the training data as well as the weights) installed locally on a machine with an Nvidia 3070 GPU. I know I can launch my terminal using the Discrete Graphics Card option, but I also believe this version was built for CPU use and I’m guessing that I’m not getting the most out of my graphics card

What’s the move here?

1

KerfuffleV2 t1_jd7sb4u wrote

llama.cpp and alpaca.cpp (and also related projects like llama-rs) only use the CPU. So not only are you not getting the most out of your GPU, it's not getting used at all.

I have an old GPU with only 6GB so running larger models on GPU isn't practical for me. I haven't really looked at that aspect of it much. You could start here: https://rentry.org/llama-tard-v2

Keep in mind you will need to be pretty decent with technical stuff to be able to get it working based on those instructions even though they are detailed.

1

keeplosingmypws t1_jd9wpwm wrote

Thanks for leading me in the right direction! I’ll letcha know if I get it working

1

Unlucky_Excitement_2 t1_jdavhcr wrote

Bro what are you talking about LOL. Its context length he's discussing. There are multiple ways[all of which I'm expertimenting with] ->

  1. flash attention
  2. strided context window
  3. finetuning on a dataset with longer sequences
0

KerfuffleV2 t1_jdbrkc1 wrote

Uh, did you reply to the wrong person or something? Your post doesn't have anything to do with either mine or the parent.

3

gliptic t1_jd2bsc7 wrote

In fact, GPT3 is 175B. But GPT3 is old now and doesn't make effective use of those parameters.

1

uspmm2 t1_jd1jh1b wrote

are you talking about the 30b one?

1

Straight-Comb-6956 t1_jd1srkd wrote

Haven't tried the 30B model. 65B takes 900ms/token on my machine.

3

msgs t1_jd46yf9 wrote

do you have a link to a torrent/download for the 30B or 65B weights that works with Alpaca.cpp? reddit DMs are fine if don't want to post it publicly.

1

SpiritualCyberpunk t1_jd1m06i wrote

Idk why, but after the first answer to a question addressed to it, mine spewed out random nonsense. Literally unrelated things.

1