Hey everyone,

First of all, tldr at bottom, typed more than expected here.

Please excuse the rather naive perspective I have here. I've followed along with great interest, but this is not my industry.

Regardless, I have spent the past 3-4 days falling down a brutally obsessive rabbit hole, and I cannot seem to find this information. I'm assuming it's just that I am missing context of course, and regardless of whether there is a clear answer, I'm trying to get a better understanding of this topic so that I could better appraise the situation myself.

Really I suppose I have two questions. The first is regarding model parallelization.

I'm assuming this is not generic whatsoever. What is the typical process engineers go about for designing such a pipeline? Specifically in regards to these new LLaMA models, is something like ALPA relevant? Deepspeed?

More importantly, what information should I be seeking to determine this myself?

This roughly segues to my second inquiry.

The reason I'm curious about splitting the model pipeline etc., is that I am potentially in interested in standing a server up for this. Although I don't have much of a budget for this build (~$30-40K is the rough top-end, but I'd be a lot happier around $20-25K), the money is there if I can genuinely satisfy my use-case.

I work at a small, but borderline manic startup working on enterprise software; 90% of the work we're doing based in the react/node ecosystem, some low-level work for backend services, and some very interesting database work that I have very little to do with. I am a fullstack engineer that grew up playing with C++ => C#, and somehow ended up spending all of my time r/w'ing javascript. Lol. Anyways.

Part of our roadmap since GPT-3 and the playground were made publicly accessible, involves usage of these transformer models, and their ability to interpret natural language inputs, whether from user inputs, or scraped input values generated somewhere in a chain of requests / operations.

Seeing GPT-3 in action made me specifically realize that my estimations on this technology had been wildly off. Seeing ChatGPT in action and uptick, the API's becoming available, has me further panicked.

Running our inference through their API has never really been an option for us. I haven't even really looked that far into it, but bottom line the data running through our platform is all back-office, highly sensitive business information, and many have agreements explicitly restricting the movement of data to or from any cloud services, with Microsoft, Amazon, and Google all specifically mentioned.

Regardless of the reasoning for these contracts, the LLaMA release has had me obsessed over this topic in more detail than before, and whether or not I would be able to get this setup privately, for our use-case.

To get to the actual second inquiry:

Say I want to throw a budget rig together for this in a server cabinet. Am I able to effectively parallelize the LLaMA model, well enough to justify going with 24GB VRAM 4090's in the rig? Say I do so with DeepSpeed, or some of the standard model parallelization libraries.

Is the performance cost low enough to justify taking the extra compute here over 1/3 - 1/2 as many RTX6000 ADA's?

Or should I be grabbing the 48GB ADA's?

Like I said, I apologize for the naivety, I'm really looking for more information so that I can start to put this picture together better on my own. It really isn't the easiest topic to research with how quickly things seem to move, and the giant gap between conversation depths (gamer || phd in a lot of the most interesting or niche discussions, little between).

Thank you very much for your time.

TL;DR - Any information on LLaMA model parallelization at the moment? Will it be compatible with things like zero or alpa? How about for throwing a rig together right now for fine-tuning and then running inference on the LLaMA models? 48GB 6000 ADA's, or 24GB 4090's?

Planning on putting it in a mostly empty 42U cabinet that also houses our primary web server and networking hardware, so if there is a sales pitch for 4090's across multiple nodes here, I do have a massive bias as the kind of nerd that finds that kind of hardware borderline erotic.

Hydro and cooling are not an issue, just usage of the budget and understanding the requirements / approach given memory limitations, and how to avoid communication bottlenecks or even balance them against raw compute.

Thanks again everyone!

Comments

Appropriate_Ant_4629 t1_jb1rhkh wrote on March 5, 2023 at 7:47 PM

Take a step back:

Start on a cloud -- renting GPUs or TPUs -- with nonsensitive data.

I know you said "but bottom line the data running through our platform is all back-office, highly sensitive business information, and many have agreements explicitly restricting the movement of data to or from any cloud services".

You shouldn't be touching such information during development anyway.

Make or find a non-sensitive dataset of similar scale for development.

Don't buy hardware up front until you have almost the entire data pipeline working well on rented servers. Rent them hourly on any of the big cloud platforms, and you'll quickly be able to quantify most of your hardware requirements. How much RAM you need in GPUs/TPUs. How much RAM you need on CPUs. How fast a storage layer you'll need.

Only after you have an at-scale dev/qa environment working on a cloud, will you have any idea what physical hardware you'd want to buy.

ChristmasInOct OP t1_jb2enle wrote on March 5, 2023 at 10:29 PM

I really appreciate this response.

I'm not planning on using any of our data or touching the infrastructure yet, but for some reason I never considered using the cloud to determine hardware configuration.

Thanks again. Exactly what I needed!

karyo t1_jb03jq0 wrote on March 5, 2023 at 12:00 PM

The first question is kinda difficult. Deep speed, zero, Megatron all play into it. There's a reason somebody recently said that there are only 200 people on the world atm that can pull it off.

For the second question ,

4090s just won't cut it. Nvidia fused off P2P this generation so unless you have an embarrassingly parallel pipeline ( which current llms aren't) they are not useful. Problem is ada a6000 was restricted severely P2P wise.

If you're doing llms at billion scale you gotta get v,a,h100s

ChristmasInOct OP t1_jb2cwwf wrote on March 5, 2023 at 10:17 PM

Thanks for the response. Do you recall where you read the "only 200 people" bit? I'll take a look around for it as well; seems like the context could have found itself surrounded by interesting conversation.

P2P is not so much of a limitation so long as you can fit the entire model / pipeline into a single cards VRAM though, correct?

So for example, if you have a 7B Param model at FP16 and its around 14GB, presumably you should be safe with 24GB VRAM?

Thanks again for your time.

karyo t1_jb2qhcd wrote on March 5, 2023 at 11:58 PM

https://twitter.com/ericjang11/status/1627818245406461952?s=20

karyo t1_jb2qo4e wrote on March 5, 2023 at 11:59 PM

For inference?yes. Look at eleutherai transformer math page. Also others are trying out llama rn so check them out

Nerveregenerator t1_jb3milx wrote on March 6, 2023 at 4:21 AM

I was able to run the 7B on a v100 on lambda labs. Didnt try the other ones.

fundamental_entropy t1_jb4bu9u wrote on March 6, 2023 at 9:30 AM

For the first question , we are moving towards not having to design such pipelines , ideally we will have a library which will do the model sharding or parallel computation for us. Look at parallelformers which worked for some big models(11B) i tried. Why i think this is going to happen is , 3 years back distributed training used to be a big black box, horovod, pytorch distributed training and TPUs are the only solution but right now no one designs such peipelines anymore ,everyone uses deepspeed. It has implementations of all known techniques(zero , cpu offloading etc). So if you are not one of these computation/data engineers , i suggest to watch out for such libraries.