Submitted by head_robotics t3_1172jrs in MachineLearning

I've been looking into open source large language models to run locally on my machine.

Seems GPT-J and GPT-Neo are out of reach for me because of RAM / VRAM requirements.

What models would be doable with this hardware?:

CPU: AMD Ryzen 7 3700X 8-Core, 3600 MhzRAM: 32 GB

GPUs:

  1. NVIDIA GeForce RTX 2070 8GB VRAM
  2. NVIDIA Tesla M40 24GB VRAM
220

Comments

You must log in or register to comment.

Disastrous_Elk_6375 t1_j99ry6s wrote

GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

I managed to run GPT-J 6B on a 3060 w/ 12GB and it takes about 7.2GB of VRAM.

52

gliptic t1_j99y0cp wrote

RWKV can run on very little VRAM with Rwkvstic streaming and 8-bit. I've not tested streaming, but I expect it's a lot slower. 7B model sadly takes 8 GB with just 8-bit quantization.

39

ArmagedonAshhole t1_j9a1vq3 wrote

it depends mostly on settings so no.

Small context like 200-300 tokens could work with 24GB but then your AI will not remember and connect dots well which would make model worse than 13B

People are working right now on spliting work between gpu(vram) and cpu(ram) in 8bit mode. I think like 10% to RAM would make model work well on 24GB vram card. IT would be a bit slower but still usable.

If you want you can always load whole model to ram and run it via cpu but it is very slow.

12

avocadoughnut t1_j9a64k1 wrote

Yup. I'd recommend using whichever RWKV model that can be fit with fp16/bf16. (apparently 8bit is 4x slower and lower accuracy) I've been running GPT-J on a 24GB gpu for months (longer contexts possible using accelerate) and I noticed massive speed increases when using fp16 (or bf16? don't remember) rather than 8bit.

16

Rockingtits t1_j9afl0a wrote

Why not look into distilled models like DistilBERT

2

wywywywy t1_j9apjs3 wrote

I had a 3070 with 8GB and I managed to run these locally through KoboldAI.

Meta OPT 2.7B
EleutherAI GPT-Neo 2.7B
BigScience Bloom 1.7B

32

wywywywy t1_j9ar2tk wrote

I did test larger but it didn't run. I can't remember which ones, probably GPT-J. I recently got a 3090 so I can load larger models now.

As for quality, my use case is simple (writing prompt to help with writing stories & articles) and nothing sophisticated, and they worked well. Until ChatGPT came along. I use ChatGPT instead now.

6

CommunismDoesntWork t1_j9b1qjb wrote

I'm surprised pytorch doesn't have an option to load models partially in a just in time basis yet. That way even an infinitely large model can be infered on.

7

wywywywy t1_j9b2kqu wrote

So, not scientific at all, but I've noticed that checkpoint file size * 0.6 is pretty close to actual VRAM requirement for LLM.

But you're right it'd be nice to have a table handy.

11

Last-Belt-4010 t1_j9b8gtl wrote

Just a question does this work with non Nvidia gpus? Like Intel arc and such

2

catch23 t1_j9b9upb wrote

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

56

Purplekeyboard t1_j9bd1jg wrote

Keep in mind, these smaller models are going to be a lot dumber than what you've likely seen in GPT-3.

15

AnothaUselessComment t1_j9c9er6 wrote

Yikes, this may be tough.

I know you can try Bloom (like this blog post tried) and let it try and download overnight, but you may run into problems. (I've heard the download takes forever)

https://enjoymachinelearning.com/blog/gpt-3-vs-bloom/

Though I will say, it's probably worth whatever cost you're trying to dodge just to hit an API, even if your hardware is great.

2

nikola-b t1_j9cqkys wrote

Not sure if this helps, but you can use our hosted flan-t5 model at deepinfra.com using HTTP API. It's free for now. Disclaimer I work at deepinfra. If you want GPT-Neo or GPT-J I can deploy those also.

3

pyonsu2 t1_j9ds6j5 wrote

Depends on what you’re trying to do but just use OpenAI APIs. Your effort/time is also expensive.

3

xrailgun t1_j9dtp9c wrote

It might not be unreasonable to think maybe OP primarily wants the functionality of current LLMs, and if something can provide that more efficiently (or has promise to in the near future), s/he may want to know about it too.

6

smallfried t1_j9dtyf7 wrote

That is very interesting!

The paper is not yet on GitHub, but I'm assuming the hardware requirements are as mentioned one beefy consumer GPU (3090) and a whole bunch of DRAM (>210GB) ?

I've played with opt-175b and with a bit of twiddling it can actually generate some Python code :)

This is very exciting as it gets these models into the prosumer range hardware!

8

catch23 t1_j9dxlze wrote

Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.

6

Baeocystin t1_j9e6s12 wrote

The tl;dr for all GPU questions is that CUDA is the answer. There are no other even 'kinda' contenders.

I'm not happy about the monopoly, but that's where we're at, and there is nothing on the horizon pointing otherwise, either.

4

halixness t1_j9e80y1 wrote

So far I have tried BLOOM Petals (a distributed LLM), inference took me around 30s for a single prompt on a 8GB VRAM gpu, but not bad!

1

Snoo9704 t1_j9e8k2w wrote

I'm a super learning noob, but is there a reason you can't substitute large amounts of VRAM with large amounts of DRAM?

I know RAM bandwidth is important, but does it make that much of a difference if I got 256GB of quad channel DRAM and only 8GB VRAM? Compared to a more typical 32GB DRAM and 24GB VRAM?

2

marcus_hk t1_j9g5hns wrote

Seems it shouldn't be too difficult to run one stage or layer at a time and cache intermediate results.

1

nikola-b t1_j9hk5q4 wrote

Free for now, we have not added the payment workflow. In the future, you are billed only for the inference time, so with 1h you should be able to generate lots of tokens. Also I added EleutherAI/gpt-neo-2.7B and EleutherAI/gpt-j-6B if the op wants to try them.

3

tyras_ t1_j9pjkcx wrote

I finally got some time and was excited to try out. I did not see many LLMs pretrained on biomedical data available anywhere.

Anyway, while I could log in without a problem both CURL and deepctl return 401. Now I wonder whether it was cut off or did I miss some extra registration or authorization step that was not mentioned in the docs.

2