Submitted by ChristmasInOct t3_11ium8l in deeplearning
Hey everyone,
First of all, tldr at bottom, typed more than expected here.
Please excuse the rather naive perspective I have here. I've followed along with great interest, but this is not my industry.
Regardless, I have spent the past 3-4 days falling down a brutally obsessive rabbit hole, and I cannot seem to find this information. I'm assuming it's just that I am missing context of course, and regardless of whether there is a clear answer, I'm trying to get a better understanding of this topic so that I could better appraise the situation myself.
Really I suppose I have two questions. The first is regarding model parallelization.
I'm assuming this is not generic whatsoever. What is the typical process engineers go about for designing such a pipeline? Specifically in regards to these new LLaMA models, is something like ALPA relevant? Deepspeed?
More importantly, what information should I be seeking to determine this myself?
This roughly segues to my second inquiry.
The reason I'm curious about splitting the model pipeline etc., is that I am potentially in interested in standing a server up for this. Although I don't have much of a budget for this build (~$30-40K is the rough top-end, but I'd be a lot happier around $20-25K), the money is there if I can genuinely satisfy my use-case.
I work at a small, but borderline manic startup working on enterprise software; 90% of the work we're doing based in the react/node ecosystem, some low-level work for backend services, and some very interesting database work that I have very little to do with. I am a fullstack engineer that grew up playing with C++ => C#, and somehow ended up spending all of my time r/w'ing javascript. Lol. Anyways.
Part of our roadmap since GPT-3 and the playground were made publicly accessible, involves usage of these transformer models, and their ability to interpret natural language inputs, whether from user inputs, or scraped input values generated somewhere in a chain of requests / operations.
Seeing GPT-3 in action made me specifically realize that my estimations on this technology had been wildly off. Seeing ChatGPT in action and uptick, the API's becoming available, has me further panicked.
Running our inference through their API has never really been an option for us. I haven't even really looked that far into it, but bottom line the data running through our platform is all back-office, highly sensitive business information, and many have agreements explicitly restricting the movement of data to or from any cloud services, with Microsoft, Amazon, and Google all specifically mentioned.
Regardless of the reasoning for these contracts, the LLaMA release has had me obsessed over this topic in more detail than before, and whether or not I would be able to get this setup privately, for our use-case.
To get to the actual second inquiry:
Say I want to throw a budget rig together for this in a server cabinet. Am I able to effectively parallelize the LLaMA model, well enough to justify going with 24GB VRAM 4090's in the rig? Say I do so with DeepSpeed, or some of the standard model parallelization libraries.
Is the performance cost low enough to justify taking the extra compute here over 1/3 - 1/2 as many RTX6000 ADA's?
Or should I be grabbing the 48GB ADA's?
Like I said, I apologize for the naivety, I'm really looking for more information so that I can start to put this picture together better on my own. It really isn't the easiest topic to research with how quickly things seem to move, and the giant gap between conversation depths (gamer || phd in a lot of the most interesting or niche discussions, little between).
Thank you very much for your time.
TL;DR - Any information on LLaMA model parallelization at the moment? Will it be compatible with things like zero or alpa? How about for throwing a rig together right now for fine-tuning and then running inference on the LLaMA models? 48GB 6000 ADA's, or 24GB 4090's?
Planning on putting it in a mostly empty 42U cabinet that also houses our primary web server and networking hardware, so if there is a sales pitch for 4090's across multiple nodes here, I do have a massive bias as the kind of nerd that finds that kind of hardware borderline erotic.
Hydro and cooling are not an issue, just usage of the budget and understanding the requirements / approach given memory limitations, and how to avoid communication bottlenecks or even balance them against raw compute.
Thanks again everyone!
karyo t1_jb03jq0 wrote
The first question is kinda difficult. Deep speed, zero, Megatron all play into it. There's a reason somebody recently said that there are only 200 people on the world atm that can pull it off.
For the second question ,
4090s just won't cut it. Nvidia fused off P2P this generation so unless you have an embarrassingly parallel pipeline ( which current llms aren't) they are not useful. Problem is ada a6000 was restricted severely P2P wise.
If you're doing llms at billion scale you gotta get v,a,h100s