Submitted by soupstock123 t3_106zlpz in deeplearning

Hi there, I'm building a machine for deep learning and ml work, and I wanted to critique advice on my build. The target is 4 3090s, but I'm just trying to decide the cpu and the motherboard rn. There are a few other options I considered and these were my thoughts on each. Let me know if there's some flaw in my thinking

  1. amd thread ripper 3:
  • expensive chip and mobo
  • end of life already and prices still haven't gone down much on these lol
  • 64 PCIe4 lanes so def enough lanes
  1. Intel i9 10980xe and a x299 motherboard
  • 48 PCIe3 lanes, enough for 4 gpus
  • kinda old and slight premium for x299 chipset

In the end, I decided to do this build: https://ca.pcpartpicker.com/list/Vmyvtn https://ca.pcpartpicker.com/list/vGkhwc

~~- AMD Ryzen 9 7950X:

  • 16 PCIe5 lanes,
  • with 4 gpus that's 4 PCIe4 lanes per gpu~~

I'm wondering what's your opinion on my build, yes, there are only 16 lanes, but they're PCIe5, and 4 lanes of PCIe5 equals 8 lanes of PCIe4, so in theory, it should be fine right? For the case, I'm planning on just using a mining rig frame and putting everything there for now. Future plans would be to waterblock everything and have a nice case.

Edit: After reviewing some of the comments, I've decided to get a threadripper 3960X, and an ASRock TRX40 Creator for the mobo.

Also, question about ram speed, the ASRock Creator can support DDR4 ram speeds up to 4666, is there a need to go that high? I'm planning to go to 128GB of RAM, and higher speeds are definitely more expensive. Is there a sweet spot of cost/perf or does ram speed not even matter for deep learning?

Somethings I learned: Check the bifurcation/division of lanes on the PCIe ports on the mobo, even if the processor has enough lanes, the mobo might not split them ideally.

7

Comments

You must log in or register to comment.

hjups22 t1_j3k2kei wrote

What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?

Or if you are doing inference, what types of models do you intend to run.

The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.

6

VinnyVeritas t1_j3l04w8 wrote

Each time someone asks this question, someone repeats this misinformed answer.

This is incorrect, NVLink doesn't make much difference.

https://www.pugetsystems.com/labs/hpc/rtx-2080ti-with-nvlink-tensorflow-performance-includes-comparison-with-gtx-1080ti-rtx-2070-2080-2080ti-and-titan-v-1267/

3

hjups22 t1_j3l1l6n wrote

That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.

Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.

Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.

4

qiltb t1_j3l9suz wrote

that also depends on input image size though...

1

hjups22 t1_j3lk4e5 wrote

Could you elaborate on what you mean by that?
The advantage of NVLink is gradient / weight communication, which is independent of image size.

3

qiltb t1_j3mmcca wrote

Sorry, I referred explicitly to the the last paragraph of yours (that it's quick for small models)

1

hjups22 t1_j3nqeim wrote

Then I agree. If you are doing ResNet inference on 8K images, then it will probably be quite slow. However 8K segmentation will probably be even slower (the point of comparison that I was thinking of).
Also, when you get to large images, I suspect the PCIe will become a bottleneck (sending data to the GPUs), which will not be helped by the setup described by the OP.

1

qiltb t1_j3l9q8s wrote

Well, in just the most basic tasks - like plain resnet100 training (classification) by using nvlink - there is a huge difference.

1

VinnyVeritas t1_j3ng2u9 wrote

Do you have some numbers or a link because all benchmarks I've seen point to the contrary? I'm happy to update my opinion if things have changed and there's data to support it.

1

soupstock123 OP t1_j3l2srl wrote

Right now mostly CNNs, RNNs, and playing around with style transfers with GANs. Future plans include running computer vision models trained on videos and testing inferencing, but still researching how demanding that would be.

1

hjups22 t1_j3l3ln2 wrote

Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).

Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).

2

emanresuymsseug t1_j3kdzbv wrote

> - with 4 gpus that's 4 PCIe4 lanes per gpu

With the Asus PRIME B650M-A AX you are looking at 16 lanes for 1 GPU and 1 lane each for the other 3 GPUs.

PCIEX16_2, PCIEX16_3 and PCIEX16_4 slots are electrically connected in x1 mode.

Bifurcation is only supported via PCIEX16_1 slot.

4

soupstock123 OP t1_j3kzq2x wrote

Damn, 1 lane each for the other 3 isn't enough for my needs. The bifurcation kinda sucks here. Thanks for the advice.

1

rikonaka t1_j3jucus wrote

I think the thread rapper 5990x is better, the mainboard you can uses supermicro server motherboard, I have a computer using amd 5950x and a 3090ti, the mainboard is x570, if you use 4x 3090, it is best to use a server mainboard, for stable and performance.😉

2

rikonaka t1_j3jvvne wrote

I read your shopping list, there's has two problems, the motherboard and power supply, the power of one 3090 is 350 watts, and the power of four is 1400 watts, so your power supply should be at least 2000 watts (The specific calculation will be made after the CPU is determined), the problem with the motherboard is that the b650 does not support four 3090 video cards, and it only has two videos card slots.😉

1

soupstock123 OP t1_j3jx9ko wrote

Yeah, there's no way to add two PSUs on the pcparpicker, so that's mean to be 2 of the 1000W ones.

The b650 supports 4. It has enough ports. The blocking is not an issue because I'm going to be using GPU risers to fit the 4 GPUs.

To respond to your first comment, the thread ripper is also very expensive, and I'm waiting until Sept 2023 when threadripper 7 comes out and drops prices for thread rippers.

1

rikonaka t1_j3jyzos wrote

Well, I’m not sure how feasible the method of two power supplies for one host is😂, and it’s still a problem with the motherboard. I can’t comment on the stability of the method of connecting four 3090s using graphics card expansion (because I haven’t done so yet), I think you should carefully consider your plan, the cost of trial and error is not low.

1

qiltb t1_j3kjvki wrote

I actually works very well with ADD2PSU connector (used like 5 PSUs for one 14x3090 rig). He should actually think more of getting 1600W HIGH QUALITY PSU.

Corsair RM series IS NOT SUITABLE for workload you are looking for. Use preferrably AXi series or HXi if you really want to cheap out. We are talking about really abusing those PSUs. AX1600i is still unmatched for this usecase.

1

soupstock123 OP t1_j3l0lhk wrote

Thanks for the advice, can you elaborate more about how the Corsair RM series is not suitable for the workload? My rationale was that because it's an open air mining frame instead of a case, I wanted the RM series which supposedly is quieter?

1

qiltb t1_j3l9hll wrote

Under full load, AXi series is basically silent. But main reason is that PSU is not of high enough quality to actually sustain that load (even higher grade PSUs like EVGA P2 series has problems with infamous 3090 under DL task load) . Also take a look at my big comment on this reddit post.

1

soupstock123 OP t1_j3mteu8 wrote

Hmm, the Axi1600 might not be enough for me. This is my new build: https://ca.pcpartpicker.com/user/sixartsdragon/saved/DCh6Q7 and I'm looking at 1821W, so realistically I'm looking for a 2K PSU. I've chosen a Super Flower Leadex 2000 for now? What do you think?

1

qiltb t1_j3o7ull wrote

I actually assumed you will be having 2 PSUs. For least problems, buy 2xAX1600i, for cheaper option buy 2xAX1200i. One PSU is actually the worst case, but yeah you can try with a single SFL 2000.

1

soupstock123 OP t1_j3of5oq wrote

What do you mean least problems and why is one worst case?

1

Volhn t1_j3kx4ut wrote

Just get a single 3090 and the 7950x or 13 series intel, then use the rest that you would have spent on renting bigger GPUs in the cloud.

1

soupstock123 OP t1_j3l0fmt wrote

I plan on using this frequently, and compared to even renting a similar configuration online, I would break even after a year.

1

VinnyVeritas t1_j3l0gqt wrote

I don't know if that's going to work well to have 16 PCIe lane, everyone here I've seen making 4 GPUs machines uses the CPUs that have 48 or 16 PCIe lanes.

Also you'll need a lot of watts to power that monster, not to mention you need a 10-20% margin if you don't want fry the PSU.

1

soupstock123 OP t1_j3l0q8f wrote

Yeah, that's what I've basically discovered too. The mobo with the 16 PCIe lanes isn't going to work out. Changed my build to threadripper. Any advice or suggestions for a PSU that can handle the workload?

1

VinnyVeritas t1_j3nh3g4 wrote

I suppose one PSU will take care of motherboard + CPU + some GPUs and the other one will take care of remaining GPUs.

So if you get 4x 3090, that's 350W x4 = 1400W just for GPUs, +300 watts for CPU, +powering the rest of the components, drives, etc... So let's say we round that up to 2000W, then add at least 10% margin, that's 2200W total.

So maybe 1600W PSU for mobo and some GPUs, and another 1000W or more for the remaining GPUs. Note, if you go with 3090TI, it's more like 450-500W per card, so you have to do the maths.

Or if you want to go future proof, just put two 1600W PSUs, and then you can just swap your 3090 with 4090 in the future and not worry about upgrading PSUs.

1

soupstock123 OP t1_j3nmsho wrote

I'm seeing the argument for 2 1600 PSUs. It's fine for the mining rig case frame, but it's baiscly confirming to me that this is never going to fit in a case lol.

1

VinnyVeritas t1_j3rrzvr wrote

Actually I've been sort of looking at ML computers (kind of of browsing and dreaming one day I would have one, but it's always going to be out of my means and needs anyway). Anyway, they can put two PSUs in a box, obviously it's made by companies, so the total cost is twice or 3 times the cost of the parts alone (e.g. building yourself would be 2-3x cheaper) but it could inspire you for picking your parts.https://bizon-tech.com/amd-ryzen-threadripper-up-to-64-cores-workstation-pc

https://shop.lambdalabs.com/gpu-workstations/vector/customize

1

Final-Rush759 t1_j3o261b wrote

Buy 2× 4090

1

soupstock123 OP t1_j3o2jk3 wrote

Performance is only 1.9 times the 3090 for deep learning and price is more than double, just bad value rn.

1

Final-Rush759 t1_j3o2zd3 wrote

You save a lot electricity cost. Much beter value if you plan to use it a lot. It"s much easier to manage 2 cards than 4 cards.

1

soupstock123 OP t1_j3o42j4 wrote

For sure that's the upgrade path in the future, but rn my electricity is free so it's not too much of an issue.

1

Final-Rush759 t1_j3o39ds wrote

3090 is not a good card, running at high temperatures, high noise, excessively high VRAM temperatures.

1

soupstock123 OP t1_j3o4t0l wrote

It's a budgeting issue, if I could do 4 4090s, I would.

1