Cryptheon
Cryptheon t1_j1cq8q3 wrote
Hi, I'm a high performance machine learning consultant working on this. I've run BLOOM on a cluster (not exactly aws/azure).
You could, if you have a large enough GPU, run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface. I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total). Inference time for 50 tokens still took 40 mins out of the box; using bf16. If you want to bring this down and make it cost effective you need to have at least 8 80GBs A100 (640 GB VRAM). Int8 will slash this requirement by half, however that means sacrificing inference time due to the nature of the int8 method. On top of that, there are still some optimizations on a cluster level you will have to do if you really want to bring that inference time down to a few miliseconds per token generation. This is probably how OpenAI does it; they keep models continuously loaded on their GPUs, with highly optimized methods, so we can all use their models en-masse.
Point being, this is not something trivial to do and will cost money, expertise and time. Besides, BLOOM is not the best model performance wise because it's a multi language model. As others have mentioned, OpenAI's chat-gpt has further been trained using RL (PPO) on data we don't have access to.
Cryptheon t1_j136urn wrote
Reply to [D] What GPT-esque model/platform returns peer-reviewed sources with outputs? by EntireInflation8663
Check galactica by Meta
Cryptheon t1_izpfbxl wrote
Reply to [D] A talk about ChatGPT by [deleted]
Yeah, I actually was succesful generating small weights for a small NN, see the following https://www.reddit.com/r/OpenAI/comments/zghtvu/hallucinating_optimized_model_weights_with_chatgpt/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button.
If you frame it in the right way, for small enough datasets you can generate a small network. Have yet to confirm this for slightly more complicated data though.
Cryptheon t1_j1g9v0r wrote
Reply to comment by Evoke_App in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.
In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.