computing_professor t1_iynllw4 wrote on December 2, 2022 at 6:58 PM

I'm far from an expert but remember the 4090s are powerful but won't pool memory. I'm actually looking into a lighter setup than you with either an A6000 or, more likely, 2x 3090s with nvlink so I can get access to 48GB of vRAM. While the 4090 is much faster, you won't have access to as much vRAM. But if you can make do with 24GB and/or can parallelize your model, 2x 4090s would be awesome.

edit: Just re-read your post and I see I missed you mention parallelizing already. Still, if you can manage, 2x 4090 seems incredibly fast. I would do that if it was me, but I don't care much about computer vision.

TheButteryNoodle OP t1_iynr0g4 wrote on December 2, 2022 at 7:35 PM

Hey there! Thanks for the response! I'm a little bit of a novice when it comes to how nvlink works, but wouldn't you still need to use model parallelization to fit a model over 24GB with 2x 3090s connected via nvlink? I thought they would effectively still show as two different devices, similar to 2 4090s; of course the benefit here being that the nvlink bridge directly connects the two gpus instead of going over pcie. Not too knowledgeable about this, so please feel free to correct me if I'm wrong!

Dexamph t1_iyo1ryt wrote on December 2, 2022 at 8:47 PM

Looked into this last night and yeah, NVLink works the way you described because of misleading marketing- no contiguous memory pool, just a faster interconnect so maybe model parallelisation scales a bit better but you still have to implement it. Also saw an example where some PyTorch GPT2 models scaled horrifically in training with multiple PCIe V100s and 3090s that didn’t have NVLink so that’s a caveat with dual 4090s not having NVLink.

The RTX 6000 Ada lets you skip model sharding so that’s factored into the price. You lose the extra GPU so you have less throughput though.

You might be able to get away with leaving the 4090s at the stock 450W power limit since it seems the 3090/3090Ti transient spikes have been fixed.

I’m a bit skeptical about the refurb A100, like how would warranty work if it died one day? Did you consider how you’d cool it since it seems you have a standard desktop case while they were designed for rack mount servers with screaming loud fans hence the passive heatsink? Put thoughts and prayers that the little blower fan kits on eBay for ewasted Teslas are up to the task of cooling it?

TheButteryNoodle OP t1_iyo7url wrote on December 2, 2022 at 9:28 PM

Right. Model parallelization was one of my concerns with any type of dual GPU setup as it can be a hassle at times and isn't always suitable for all models/use cases.

As for the A100, the general plan was to purchase a card that still has Nvidia's Manufacturer Warranty active (albeit that may be a bit tough at that price point). If there is any type of extended warranty that I could purchase, whether it's from Nvidia or a reputable third party, I would definitely be looking into those. In general, if the A100 was the route I would be going, there would be some level of protection purchased, even if it costs a little bit more.

As for the cooling, you're right... that is another pain point to consider. The case that I currently have is a fractal design torrent. In this case I have 2 180mm fans in the front, 3 140mm fans at the bottom, and then a 120mm exhaust fan at the back. I would hope that these fans alongside an initial blower fan setup would provide sufficient airflow. However, if it doesn't, I would likely move again to custom water cooling.

What I'm not sure though is how close the performance of the RTX 6000 ADA comes to an A100. If the performance difference isn't ridiculous for fp16 and fp32, then it would likely make sense to lean toward the 6000. Also, there is the fp8 performance for the 6000 with CUDA 12 being right around the corner.

Dexamph t1_iyocn1i wrote on December 2, 2022 at 10:02 PM

I doubt the Torrent’s fans will do much if the blower isn’t enough because they were designed around a front to back air flow pathway with much, much higher static pressure to force air through the heatsink. We run V100s in Dell R740s on the local cluster and here’s how they sound to get the GPUs their needed airflow. So you might want to factor in the cost of custom loop water cooling into the A100 cost figure if things go south. And the spare rig as well so the true cost difference vs RTX 6000 Ada isn’t so close anymore.

I don’t know how the RTX 6000 Ada will really perform vs the A100 either because I haven’t seen the FP8 Transformer engine in action. Maybe it’ll skirt the halved memory bandwidth and land close to the A100, but the A100 delivers its performance today using today’s code.

TheButteryNoodle OP t1_iyy01wy wrote on December 5, 2022 at 12:43 AM

Good point. I guess I'll just have to wait and see what the performance of the 6000 looks like. However, I think the decision is likely going to be just going with the 4090s. Thanks again for the insight!

computing_professor t1_iyo97p0 wrote on December 2, 2022 at 9:38 PM

So this means you cannot access 48GB of vRAM with a pair of 3090s and nvlink, with TF and PyTorch? I could have sworn I've seen that it's possible. Not a deal breaker for me, but a bummer to be sure. I will likely end up with an a6000 instead, then, which isn't as fast but has that sweet vram.

Dexamph t1_iyoebd1 wrote on December 2, 2022 at 10:15 PM

You certainly can if you put the time and effort into model parallelisation, just not in a seamless way where you get a single big memory pool needing no code changes or debugging to run larger models that wouldn’t fit on one GPU that I and many others were expecting. Notice how most published benchmarks with NVLink have only tested data parallel model training because it’s really straightforward?

computing_professor t1_iyokex9 wrote on December 2, 2022 at 11:01 PM

Huh. If it requires parallelization then why is the 3090 singled out as the one consumer GeForce card that is capable of memory pooling? It just seems weird. What exactly is memory pooling then, that the 3090 is capable of? I'm clearly confused.

edit: I did find this from Puget that says

> For example, a system with 2x GeForce RTX 3090 GPUs would have 48GB of total VRAM

So it's possible to pool memory with a pair of 3090s. But I'm not sure how it's done in practice.

DingWrong t1_iyq0nr0 wrote on December 3, 2022 at 6:38 AM

Big models get sharded and chunks get loaded on each gpu. There are a lot of frameworks ready for this as the big NLP models can't fit on a single gpu. Alpa even shards the model on different machines.

computing_professor t1_iyqaku8 wrote on December 3, 2022 at 8:56 AM

Thanks. So it really isn't the same as how the Quadro cards share vram. That's really confusing.

Dexamph t1_izd1dy7 wrote on December 8, 2022 at 5:03 AM

This is deadass wrong as that Puget statement was in the context of system memory, nothing to do with pooling: > How much RAM does machine learning and AI need?

>The first rule of thumb is to have at least double the amount of CPU memory as there is total GPU memory in the system. For example, a system with 2x GeForce RTX 3090 GPUs would have 48GB of total VRAM – so the system should be configured with 128GB (96GB would be double, but 128GB is usually the closest configurable amount).

LetMeGuessYourAlts t1_iyruft9 wrote on December 3, 2022 at 5:54 PM

Do you know: Are there any Nvidia GPUs at a decent price/performance point that can pool memory? Every avenue I've looked down seems to point to nothing a hobbyist could afford being able to get a large amount of memory without resorting to old workstation GPUs that have relatively slow processors. Best bet seems to be a single 3090 if memory is the priority?

Dexamph t1_izd0gyf wrote on December 8, 2022 at 4:54 AM

Technically they all can because it relies on software, it's just that NVLink will reduce the performance penalty going between GPUs. There is no free lunch here so you damn well better know what you're doing to not get stung like this guy by speculative bullshit pushed by people who never actually had to make it work.

With that out of the way, it doesn't get any better than ex-mining 3090s that start at ~$600. Don't bother with anything older because if your problem requires model parallelisation, than your time and effort is probably worth more than the pittance you save in trying to get some old 2080Tis or 2070 Supers to keep up.

computing_professor t1_iynwyu2 wrote on December 2, 2022 at 8:15 PM

I think 2x 3090 will pool memory with nvlink, but not treat them as a single card. I think it depends on the software you're using. I'm pretty sure pytorch and tensorflow are able to take advantage of memory pooling. But the 3090 is the last GeForce card that will allow it. I hope somebody else comes into the thread with some examples of how to use it, because I can't seem to find any online.

suflaj t1_iyodcdi wrote on December 2, 2022 at 10:08 PM

2x 4090 is the most money efficient if you have model parallelism for CV. For other tasks or vision transformers, it's probably bad because of low bandwidth.

The RTX A6000 will be better for deployment. If you're only planning on training your stuff this is a non-factor. Note that it has similar, even lower bandwidth than a 4090, so there are little benefits besides power consumption, non-FP32 performance and a bigger chunk of RAM.

So honestly it's between whether or not you want a local or cloud setup. Personally, I 'd go for 1x4090 and rest on compute. If there is something you can't run on 1x4090, the A100 compute will be both more money and time efficient.

TheButteryNoodle OP t1_iyy15ut wrote on December 5, 2022 at 12:52 AM

Good points. I'd have to agree with you that the 4090s definitely do seem to be the most cost-efficient.

ShinyBike t1_j26jemn wrote on December 30, 2022 at 12:37 AM

Having owned a 4090 and used many A100s, I can safely say that the 4090 is by far faster than an A100.

suflaj t1_j2841op wrote on December 30, 2022 at 9:11 AM

You must've had some poorly optimized models then, as even the 40 GB A100 is roughly 2.0-2.1x faster than a 3090, while a 4090 is at most 1.9x but on average 1.5x faster than a 3090 according to various DL benchmarks.

mosalreddit t1_izr087o wrote on December 11, 2022 at 4:59 AM

What mobo and case do you have to put 2 4090s in?

TheButteryNoodle OP t1_izskz4u wrote on December 11, 2022 at 3:22 PM

Haven't purchased the motherboard yet, but the case would be a fractal design torrent. In order to get 2 4090s to fit you would need to go custom liquid cooling to get rid of the massive heatsinks on the 4090.

mosalreddit t1_izsyu2s wrote on December 11, 2022 at 4:59 PM

Looking forward to seeing it when done.. Please do share the pictures

TheButteryNoodle OP t1_izt1ogh wrote on December 11, 2022 at 5:19 PM

Will do!

computing_professor t1_izzi31s wrote on December 13, 2022 at 12:01 AM

I am also interested! I'm going in circles trying to decide, and I think a 2x4090 would be the best for me, too. Though I'm more likely to have it built at MicroCenter to save myself the stress.

TheButteryNoodle OP t1_izzlzck wrote on December 13, 2022 at 12:29 AM

Best of luck! 4090s at MSRP have been a challenge to find. Hopefully, supply will get better Q1 2023.

computing_professor t1_izznxpa wrote on December 13, 2022 at 12:44 AM

I may do better going through a vendor, honestly. System76 doesn't do dual 4090s, but I think Exxact does.

mosalreddit t1_izwf74u wrote on December 12, 2022 at 10:05 AM

what do you think of msi suprim or gigabyte waterforce performance/quality? both of these are 2 slots gpu with liquid cooling

TheButteryNoodle OP t1_izzldtv wrote on December 13, 2022 at 12:25 AM

I think their performance is good! My concern would be finding places for all the radiators, such that the aio pump would be able to effectively do its job. It may also cramp the case if you decide to also water cool your CPU with another aio.

GPU Comparisons: RTX 6000 ADA vs A100 80GB vs 2x 4090s

Comments