Submitted by one_eyed_sphinx t3_10x50us in deeplearning

I need to build a home server that will use Dual RTX 4090 cards. all of the motherboards support PCIe configuration of 8x 8x for dual GPU (16x for single).

I know data transfer is not a bottleneck usually, but if I need to serve a lot of models the loading/unloading of every model can take significant time even on 16x mode.

Are there any recommendations for what CPU to get (if I don't want to take server grade CPU like threadripper)?

2

Comments

You must log in or register to comment.

suflaj t1_j7s57dy wrote

At the moment a 7950X in eco mode combined with a ROG Strix X670E seems to be the best combo.

It running in 8x mode on PCI-E gen 4 doesn't really matter, according to benchmarks the performance difference is a few %. It will take a lot of time in 16x because it's pretty much the same speed. It will not get significantly faster with a different mobo, you're limited by the GPU itself, not the interface.

3

OSeady t1_j7so2j2 wrote

Just go AMD and get the fastest cpu your budget allows.

3

ThomasBudd93 t1_j7teaca wrote

Last time I checked there were problems with running on training job on two or more RTX 4090s. Did these problems solve out? There were some posts in the pytorch and NVIDIA forums.

2

one_eyed_sphinx OP t1_j7tzoiq wrote

>eco

so this is the fine point that I want to understand, what I am trying to optimize with the build is the data transfer time, how much time it takes to load a model from RAM to VRAM. if I have10 models that need 16 GB of VRAM to run, the need to share resources. so I want to "memory hot swap" (I don't know if there is a proper term for it, I found "Bin packing") the models on an incoming request. so the data transfer is somewhat critical in my point of view and as I understand it, only the PCI speed is the bottleneck here, correct me if I'm wrong.

1

ThomasBudd93 t1_j7u0ttw wrote

Someone else directed me here:

https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/12

I didn't read the whole thread and there is more in the NVIDIA forum (links can be found in the pytorch forum link from above). At least a few weeks ago it looked like the multi-GPU training for the RTX 4090s doesn't work fully where it does for the RTX 6000 Ada. Not sure if this is intended or just a bug. I called a company here in Germany and they even stopped selling multi RTX 4090 deep learning computers because of this. I asked them about the multi-GPU benchmark I saw from lambda labs and they replied that they reproduced that but saw that the training only resulted in nans. This is all I know. If you find out more, could you share it here? :) Thanks!

2

suflaj t1_j7u2qyt wrote

You want eco mode to run cooler and more efficient. As I said, the bottleneck is in the GPU, specifically its memory bandwidth, not in whatever the CPU can transfer. Modern CPUs can easily handle 3 high end GPUs at the same time, not just 2.

PCI speed has not been a bottleneck for several years, and will probably never be a bottleneck again with this form factor of GPUs. The GPU MEMORY is the bottleneck nowadays.

EDIT: And as someone else has said, yeah, you can use fast NVMEs as swap to avoid loading from disk. There used to be Optane for this kind of stuff, but well, that's dead.

2

allanmeter t1_j7u6egg wrote

You will need to be looking at Epyc or at a minimum Threadripper. I would highly encourage ECC memory if possible.

Assuming you have a handle on data vs model distribution strategy, you will need fast and ample RAM to help with data loading/offloading as you have correctly pointed out.

If in North America, plenty of choices available to you, else where in the world you will have to seek combinations out selectively as stock is always a issue.

1

allanmeter t1_j7u6v50 wrote

Yes the ram to vram transfer is not as crazy important as you think. Previously we hit this issue in the 3000 series as well, and as a result we supplemented with full TB Ram but still was not enough. Some models are incredibly greedy.

If you are on Linux, which is highly encouraged, also look to optimise your storage tier option for SWAP memory, which is similar to pagefiles in windows. You can define and mount extended Swap disks which you can trick out with multi TB nvme drives. Not same performance as RAM but last step optimisations, before you need to consider going to Quadro

2

one_eyed_sphinx OP t1_j7yqh5v wrote

>NVME

yeah, the GPU memory is horible bottleneck. I am trying to find ways to go around it but it doesnt seems there are too many best practices for it. is there a way to use pined memory for faster model data transfer?

1

suflaj t1_j7yr906 wrote

If GPU memory is the bottleneck then there is nothing you can viably do about that. If your GPU can't load the memory faster then you will need to get more rigs and GPUs if you want to speed up the loading in parallel.

Or you could try to quantize your models into something smaller that can fit in the memory, but then we're talking model surgery, not hardware.

2

allanmeter t1_j7yt39v wrote

Threadripper and Epyc purely to maximise your access to L3 cache as well. Yes lanes and cores are important too. TR and Epyc really are well engineered chips to handle sustained compute or memory optimised workloads too.

Some models use multiple GPUs with either a strategy that copied data, and then there are models that would segment layers and minimise copies of data. Hence have a look at the distribution strategies being used, and how the models support them. Some models even use the CPU as a collation model to merge split datasets and weights, I’ve rarely seen these models perform well, they’re usually highly optimised with deep layers.

Lastly there’s no real golden ratio to the Ram, vram and swap ratio, let the OS handle it, provide as much as you can, and bias towards random IOPs as the measure.

Also please keep an eye on your nvidia-sim, use the watch -n 1 nvidia-smi to keep an eye on voltage and utilisation and temperature. You might be going the exotic route and explore water cooling, else make sure there is ample room to get cool air flowing through.

Best of luck, keep at it.

2

allanmeter t1_j7ytp7i wrote

This is really good advice! Preprocessing input data for both training and inferencing is the best route to get efficiencies. Don’t feed it crazy large multidimensional dataset, try and break it up and have a look at if you can use old fashioned methods on windowing and down sampling.

Also model parameters type is important too. If you’re running fp64 then you will struggle vs a model that’s just int8. If you have mixed precision weights then you really need to think about looking at AWS Sage and get a pipeline going.

To OP, maybe you can share a little context on what models you’re looking to run? Or input data context.

1