Submitted by _underlines_ t3_zstequ in MachineLearning
Evoke_App t1_j1cwajo wrote
Reply to comment by Cryptheon in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
>run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface
>
>I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total)
Is it possible to also load it one layer at a time using 24x32GB V100s as well? And would that save on costs (compared to 8x80 A100s) without sacrificing throughput too much?
I'd just like to see if this is worth it before delving too deep into it haha.
Cryptheon t1_j1g9v0r wrote
You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.
In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.
[deleted] t1_j1gtk8y wrote
[removed]
Viewing a single comment thread. View all comments