hjups22

hjups22 t1_j3nqeim wrote

Then I agree. If you are doing ResNet inference on 8K images, then it will probably be quite slow. However 8K segmentation will probably be even slower (the point of comparison that I was thinking of).
Also, when you get to large images, I suspect the PCIe will become a bottleneck (sending data to the GPUs), which will not be helped by the setup described by the OP.

1

hjups22 t1_j3l3ln2 wrote

Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).

Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).

2

hjups22 t1_j3l1l6n wrote

That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.

Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.

Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.

4

hjups22 t1_j3k2kei wrote

What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?

Or if you are doing inference, what types of models do you intend to run.

The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.

6