Submitted by _underlines_ t3_zstequ in MachineLearning
Cryptheon t1_j1g9v0r wrote
Reply to comment by Evoke_App in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.
In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.
[deleted] t1_j1gtk8y wrote
[removed]
Viewing a single comment thread. View all comments