Cryptheon t1_j1g9v0r wrote on December 24, 2022 at 2:45 AM

Reply to comment by Evoke_App in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_

You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.

In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.

[deleted] t1_j1gtk8y wrote on December 24, 2022 at 5:55 AM

[removed]