allanmeter t1_j7u6egg wrote on February 9, 2023 at 1:00 PM

You will need to be looking at Epyc or at a minimum Threadripper. I would highly encourage ECC memory if possible.

Assuming you have a handle on data vs model distribution strategy, you will need fast and ample RAM to help with data loading/offloading as you have correctly pointed out.

If in North America, plenty of choices available to you, else where in the world you will have to seek combinations out selectively as stock is always a issue.

one_eyed_sphinx OP t1_j7ypyu3 wrote on February 10, 2023 at 10:27 AM

a minimum threadripper? you are saying this because of the number of lanes or the number of cores?
can you elaborate more on "Assuming you have a handle on data vs model distribution strategy"?

allanmeter t1_j7yt39v wrote on February 10, 2023 at 11:09 AM

Threadripper and Epyc purely to maximise your access to L3 cache as well. Yes lanes and cores are important too. TR and Epyc really are well engineered chips to handle sustained compute or memory optimised workloads too.

Some models use multiple GPUs with either a strategy that copied data, and then there are models that would segment layers and minimise copies of data. Hence have a look at the distribution strategies being used, and how the models support them. Some models even use the CPU as a collation model to merge split datasets and weights, I’ve rarely seen these models perform well, they’re usually highly optimised with deep layers.

Lastly there’s no real golden ratio to the Ram, vram and swap ratio, let the OS handle it, provide as much as you can, and bias towards random IOPs as the measure.

Also please keep an eye on your nvidia-sim, use the watch -n 1 nvidia-smi to keep an eye on voltage and utilisation and temperature. You might be going the exotic route and explore water cooling, else make sure there is ample room to get cool air flowing through.

Best of luck, keep at it.