Submitted by head_robotics t3_1172jrs in MachineLearning

I've been looking into open source large language models to run locally on my machine.

Seems GPT-J and GPT-Neo are out of reach for me because of RAM / VRAM requirements.

What models would be doable with this hardware?:

CPU: AMD Ryzen 7 3700X 8-Core, 3600 MhzRAM: 32 GB

GPUs:

  1. NVIDIA GeForce RTX 2070 8GB VRAM
  2. NVIDIA Tesla M40 24GB VRAM
220

Comments

You must log in or register to comment.

catch23 t1_j9b9upb wrote

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

56

smallfried t1_j9dtyf7 wrote

That is very interesting!

The paper is not yet on GitHub, but I'm assuming the hardware requirements are as mentioned one beefy consumer GPU (3090) and a whole bunch of DRAM (>210GB) ?

I've played with opt-175b and with a bit of twiddling it can actually generate some Python code :)

This is very exciting as it gets these models into the prosumer range hardware!

8

catch23 t1_j9dxlze wrote

Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.

6

EuphoricPenguin22 t1_j9c51t7 wrote

Does that increase inference time?

1

catch23 t1_j9cd5tw wrote

it does look to be 20-100x slower for those huge models, but still bearable if you're the only user on the machine. Still better than nothing if you don't have lots of GPU memory.

14

EuphoricPenguin22 t1_j9ceqy4 wrote

Yeah, and DDR4 DIMMs are fairly inexpensive as compared to upgrading a GPU for more VRAM.

6

luaks1337 t1_j9cajyf wrote

Yes, at least if I read the documentation correctly.

1

Disastrous_Elk_6375 t1_j99ry6s wrote

GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

I managed to run GPT-J 6B on a 3060 w/ 12GB and it takes about 7.2GB of VRAM.

52

ArmagedonAshhole t1_j99tr0r wrote

>GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

GPT-NeoX20B It will fit in 24GB vram but it will almost instantly go out of memory when context will get a bit bigger than starting page of sentences.

30

Disastrous_Elk_6375 t1_j99xxfa wrote

Are there some rough numbers on prompt size vs. ram usage after the model load? I haven't played yet with GPT-NeoX

10

ArmagedonAshhole t1_j9a1vq3 wrote

it depends mostly on settings so no.

Small context like 200-300 tokens could work with 24GB but then your AI will not remember and connect dots well which would make model worse than 13B

People are working right now on spliting work between gpu(vram) and cpu(ram) in 8bit mode. I think like 10% to RAM would make model work well on 24GB vram card. IT would be a bit slower but still usable.

If you want you can always load whole model to ram and run it via cpu but it is very slow.

12

gliptic t1_j99y0cp wrote

RWKV can run on very little VRAM with Rwkvstic streaming and 8-bit. I've not tested streaming, but I expect it's a lot slower. 7B model sadly takes 8 GB with just 8-bit quantization.

39

avocadoughnut t1_j9a64k1 wrote

Yup. I'd recommend using whichever RWKV model that can be fit with fp16/bf16. (apparently 8bit is 4x slower and lower accuracy) I've been running GPT-J on a 24GB gpu for months (longer contexts possible using accelerate) and I noticed massive speed increases when using fp16 (or bf16? don't remember) rather than 8bit.

16

wywywywy t1_j9apjs3 wrote

I had a 3070 with 8GB and I managed to run these locally through KoboldAI.

Meta OPT 2.7B
EleutherAI GPT-Neo 2.7B
BigScience Bloom 1.7B

32

xrailgun t1_j9aq903 wrote

Did you test any larger and it wouldn't run?

Also, any comments so far among those? Good? Bad? Easy? Etc?

4

wywywywy t1_j9ar2tk wrote

I did test larger but it didn't run. I can't remember which ones, probably GPT-J. I recently got a 3090 so I can load larger models now.

As for quality, my use case is simple (writing prompt to help with writing stories & articles) and nothing sophisticated, and they worked well. Until ChatGPT came along. I use ChatGPT instead now.

6

xrailgun t1_j9avboh wrote

Thanks!

I wish model publishers would indicate rough (V)RAM requirements...

4

wywywywy t1_j9b2kqu wrote

So, not scientific at all, but I've noticed that checkpoint file size * 0.6 is pretty close to actual VRAM requirement for LLM.

But you're right it'd be nice to have a table handy.

11

Purplekeyboard t1_j9bd1jg wrote

Keep in mind, these smaller models are going to be a lot dumber than what you've likely seen in GPT-3.

15

CommunismDoesntWork t1_j9b1qjb wrote

I'm surprised pytorch doesn't have an option to load models partially in a just in time basis yet. That way even an infinitely large model can be infered on.

7

nikola-b t1_j9cqkys wrote

Not sure if this helps, but you can use our hosted flan-t5 model at deepinfra.com using HTTP API. It's free for now. Disclaimer I work at deepinfra. If you want GPT-Neo or GPT-J I can deploy those also.

3

tyras_ t1_j9e9kp0 wrote

Free for now or free for an hour as the pricing tab indicates?

3

nikola-b t1_j9hk5q4 wrote

Free for now, we have not added the payment workflow. In the future, you are billed only for the inference time, so with 1h you should be able to generate lots of tokens. Also I added EleutherAI/gpt-neo-2.7B and EleutherAI/gpt-j-6B if the op wants to try them.

3

tyras_ t1_j9pjkcx wrote

I finally got some time and was excited to try out. I did not see many LLMs pretrained on biomedical data available anywhere.

Anyway, while I could log in without a problem both CURL and deepctl return 401. Now I wonder whether it was cut off or did I miss some extra registration or authorization step that was not mentioned in the docs.

2

nikola-b t1_j9ujdux wrote

There was auth bug in the code. Sorry for that. Please try again now.

1

pyonsu2 t1_j9ds6j5 wrote

Depends on what you’re trying to do but just use OpenAI APIs. Your effort/time is also expensive.

3

Rockingtits t1_j9afl0a wrote

Why not look into distilled models like DistilBERT

2

Emergency_Apricot_77 t1_j9b68si wrote

They literally asked for LARGE language models

12

xrailgun t1_j9dtp9c wrote

It might not be unreasonable to think maybe OP primarily wants the functionality of current LLMs, and if something can provide that more efficiently (or has promise to in the near future), s/he may want to know about it too.

6

Last-Belt-4010 t1_j9b8gtl wrote

Just a question does this work with non Nvidia gpus? Like Intel arc and such

2

Baeocystin t1_j9e6s12 wrote

The tl;dr for all GPU questions is that CUDA is the answer. There are no other even 'kinda' contenders.

I'm not happy about the monopoly, but that's where we're at, and there is nothing on the horizon pointing otherwise, either.

4

AnothaUselessComment t1_j9c9er6 wrote

Yikes, this may be tough.

I know you can try Bloom (like this blog post tried) and let it try and download overnight, but you may run into problems. (I've heard the download takes forever)

https://enjoymachinelearning.com/blog/gpt-3-vs-bloom/

Though I will say, it's probably worth whatever cost you're trying to dodge just to hit an API, even if your hardware is great.

2

Snoo9704 t1_j9e8k2w wrote

I'm a super learning noob, but is there a reason you can't substitute large amounts of VRAM with large amounts of DRAM?

I know RAM bandwidth is important, but does it make that much of a difference if I got 256GB of quad channel DRAM and only 8GB VRAM? Compared to a more typical 32GB DRAM and 24GB VRAM?

2

YinYang-Mills t1_j9dtwjh wrote

Is there a way to do it with single precision?

1

halixness t1_j9e80y1 wrote

So far I have tried BLOOM Petals (a distributed LLM), inference took me around 30s for a single prompt on a 8GB VRAM gpu, but not bad!

1

marcus_hk t1_j9g5hns wrote

Seems it shouldn't be too difficult to run one stage or layer at a time and cache intermediate results.

1