I need to build a home server that will use Dual RTX 4090 cards. all of the motherboards support PCIe configuration of 8x 8x for dual GPU (16x for single).

I know data transfer is not a bottleneck usually, but if I need to serve a lot of models the loading/unloading of every model can take significant time even on 16x mode.

Are there any recommendations for what CPU to get (if I don't want to take server grade CPU like threadripper)?

Comments

You must log in or register to comment.

suflaj t1_j7s57dy wrote on February 9, 2023 at 12:40 AM

#1,757,928

At the moment a 7950X in eco mode combined with a ROG Strix X670E seems to be the best combo.

It running in 8x mode on PCI-E gen 4 doesn't really matter, according to benchmarks the performance difference is a few %. It will take a lot of time in 16x because it's pretty much the same speed. It will not get significantly faster with a different mobo, you're limited by the GPU itself, not the interface.

OSeady t1_j7so2j2 wrote on February 9, 2023 at 2:59 AM

#1,758,915

Just go AMD and get the fastest cpu your budget allows.

ThomasBudd93 t1_j7teaca wrote on February 9, 2023 at 7:04 AM

#1,760,018

Last time I checked there were problems with running on training job on two or more RTX 4090s. Did these problems solve out? There were some posts in the pytorch and NVIDIA forums.

one_eyed_sphinx OP t1_j7tzoiq wrote on February 9, 2023 at 11:51 AM

#1,760,759

Replying to suflaj (#1,757,928)

>eco

so this is the fine point that I want to understand, what I am trying to optimize with the build is the data transfer time, how much time it takes to load a model from RAM to VRAM. if I have10 models that need 16 GB of VRAM to run, the need to share resources. so I want to "memory hot swap" (I don't know if there is a proper term for it, I found "Bin packing") the models on an incoming request. so the data transfer is somewhat critical in my point of view and as I understand it, only the PCI speed is the bottleneck here, correct me if I'm wrong.

one_eyed_sphinx OP t1_j7tzqmj wrote on February 9, 2023 at 11:52 AM

#1,760,761

Replying to ThomasBudd93 (#1,760,018)

can you find me a reference?

ThomasBudd93 t1_j7u0ttw wrote on February 9, 2023 at 12:04 PM

#1,760,801

Replying to one_eyed_sphinx (#1,760,761)

Someone else directed me here:

https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/12

I didn't read the whole thread and there is more in the NVIDIA forum (links can be found in the pytorch forum link from above). At least a few weeks ago it looked like the multi-GPU training for the RTX 4090s doesn't work fully where it does for the RTX 6000 Ada. Not sure if this is intended or just a bug. I called a company here in Germany and they even stopped selling multi RTX 4090 deep learning computers because of this. I asked them about the multi-GPU benchmark I saw from lambda labs and they replied that they reproduced that but saw that the training only resulted in nans. This is all I know. If you find out more, could you share it here? :) Thanks!

suflaj t1_j7u2qyt wrote on February 9, 2023 at 12:25 PM

#1,760,898

Replying to one_eyed_sphinx (#1,760,759)

You want eco mode to run cooler and more efficient. As I said, the bottleneck is in the GPU, specifically its memory bandwidth, not in whatever the CPU can transfer. Modern CPUs can easily handle 3 high end GPUs at the same time, not just 2.

PCI speed has not been a bottleneck for several years, and will probably never be a bottleneck again with this form factor of GPUs. The GPU MEMORY is the bottleneck nowadays.

EDIT: And as someone else has said, yeah, you can use fast NVMEs as swap to avoid loading from disk. There used to be Optane for this kind of stuff, but well, that's dead.

allanmeter t1_j7u6egg wrote on February 9, 2023 at 1:00 PM

#1,761,039

You will need to be looking at Epyc or at a minimum Threadripper. I would highly encourage ECC memory if possible.

Assuming you have a handle on data vs model distribution strategy, you will need fast and ample RAM to help with data loading/offloading as you have correctly pointed out.

If in North America, plenty of choices available to you, else where in the world you will have to seek combinations out selectively as stock is always a issue.

allanmeter t1_j7u6v50 wrote on February 9, 2023 at 1:05 PM

#1,761,064

Replying to one_eyed_sphinx (#1,760,759)

Yes the ram to vram transfer is not as crazy important as you think. Previously we hit this issue in the 3000 series as well, and as a result we supplemented with full TB Ram but still was not enough. Some models are incredibly greedy.

If you are on Linux, which is highly encouraged, also look to optimise your storage tier option for SWAP memory, which is similar to pagefiles in windows. You can define and mount extended Swap disks which you can trick out with multi TB nvme drives. Not same performance as RAM but last step optimisations, before you need to consider going to Quadro

one_eyed_sphinx OP t1_j7ypyu3 wrote on February 10, 2023 at 10:27 AM

#1,769,807

Replying to allanmeter (#1,761,039)

a minimum threadripper? you are saying this because of the number of lanes or the number of cores?
can you elaborate more on "Assuming you have a handle on data vs model distribution strategy"?

one_eyed_sphinx OP t1_j7yqh5v wrote on February 10, 2023 at 10:34 AM

#1,769,813

Replying to suflaj (#1,760,898)

>NVME

yeah, the GPU memory is horible bottleneck. I am trying to find ways to go around it but it doesnt seems there are too many best practices for it. is there a way to use pined memory for faster model data transfer?

one_eyed_sphinx OP t1_j7yr8st wrote on February 10, 2023 at 10:45 AM

#1,769,828

Replying to ThomasBudd93 (#1,760,801)

some of the people seem to connect it to the AMD processors and motherboards. do you think it's the reason?
nvidia is known to downgrade thier gaming GPU so people will buy the proffessional ones.

suflaj t1_j7yr906 wrote on February 10, 2023 at 10:45 AM

#1,769,829

Replying to one_eyed_sphinx (#1,769,813)

If GPU memory is the bottleneck then there is nothing you can viably do about that. If your GPU can't load the memory faster then you will need to get more rigs and GPUs if you want to speed up the loading in parallel.

Or you could try to quantize your models into something smaller that can fit in the memory, but then we're talking model surgery, not hardware.

one_eyed_sphinx OP t1_j7yrvl2 wrote on February 10, 2023 at 10:53 AM

#1,769,845

Replying to allanmeter (#1,761,064)

whats your recommendation for VRAM:RAM:NVME ratios?

allanmeter t1_j7yt39v wrote on February 10, 2023 at 11:09 AM

#1,769,877

Replying to one_eyed_sphinx (#1,769,807)

Threadripper and Epyc purely to maximise your access to L3 cache as well. Yes lanes and cores are important too. TR and Epyc really are well engineered chips to handle sustained compute or memory optimised workloads too.

Some models use multiple GPUs with either a strategy that copied data, and then there are models that would segment layers and minimise copies of data. Hence have a look at the distribution strategies being used, and how the models support them. Some models even use the CPU as a collation model to merge split datasets and weights, I’ve rarely seen these models perform well, they’re usually highly optimised with deep layers.

Lastly there’s no real golden ratio to the Ram, vram and swap ratio, let the OS handle it, provide as much as you can, and bias towards random IOPs as the measure.

Also please keep an eye on your nvidia-sim, use the watch -n 1 nvidia-smi to keep an eye on voltage and utilisation and temperature. You might be going the exotic route and explore water cooling, else make sure there is ample room to get cool air flowing through.

Best of luck, keep at it.

allanmeter t1_j7ytp7i wrote on February 10, 2023 at 11:17 AM

#1,769,897

Replying to suflaj (#1,769,829)

This is really good advice! Preprocessing input data for both training and inferencing is the best route to get efficiencies. Don’t feed it crazy large multidimensional dataset, try and break it up and have a look at if you can use old fashioned methods on windowing and down sampling.

Also model parameters type is important too. If you’re running fp64 then you will struggle vs a model that’s just int8. If you have mixed precision weights then you really need to think about looking at AWS Sage and get a pipeline going.

To OP, maybe you can share a little context on what models you’re looking to run? Or input data context.