Viewing a single comment thread. View all comments

Cryptheon t1_j1cq8q3 wrote

Hi, I'm a high performance machine learning consultant working on this. I've run BLOOM on a cluster (not exactly aws/azure).

You could, if you have a large enough GPU, run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface. I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total). Inference time for 50 tokens still took 40 mins out of the box; using bf16. If you want to bring this down and make it cost effective you need to have at least 8 80GBs A100 (640 GB VRAM). Int8 will slash this requirement by half, however that means sacrificing inference time due to the nature of the int8 method. On top of that, there are still some optimizations on a cluster level you will have to do if you really want to bring that inference time down to a few miliseconds per token generation. This is probably how OpenAI does it; they keep models continuously loaded on their GPUs, with highly optimized methods, so we can all use their models en-masse.

Point being, this is not something trivial to do and will cost money, expertise and time. Besides, BLOOM is not the best model performance wise because it's a multi language model. As others have mentioned, OpenAI's chat-gpt has further been trained using RL (PPO) on data we don't have access to.

11

Evoke_App t1_j1cwajo wrote

>run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface
>
>I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total)

Is it possible to also load it one layer at a time using 24x32GB V100s as well? And would that save on costs (compared to 8x80 A100s) without sacrificing throughput too much?

I'd just like to see if this is worth it before delving too deep into it haha.

1

Cryptheon t1_j1g9v0r wrote

You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.

In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.

2