Viewing a single comment thread. View all comments

VinnyVeritas t1_j3l04w8 wrote

3

hjups22 t1_j3l1l6n wrote

That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.

Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.

Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.

4

qiltb t1_j3l9suz wrote

that also depends on input image size though...

1

hjups22 t1_j3lk4e5 wrote

Could you elaborate on what you mean by that?
The advantage of NVLink is gradient / weight communication, which is independent of image size.

3

qiltb t1_j3mmcca wrote

Sorry, I referred explicitly to the the last paragraph of yours (that it's quick for small models)

1

hjups22 t1_j3nqeim wrote

Then I agree. If you are doing ResNet inference on 8K images, then it will probably be quite slow. However 8K segmentation will probably be even slower (the point of comparison that I was thinking of).
Also, when you get to large images, I suspect the PCIe will become a bottleneck (sending data to the GPUs), which will not be helped by the setup described by the OP.

1

qiltb t1_j3l9q8s wrote

Well, in just the most basic tasks - like plain resnet100 training (classification) by using nvlink - there is a huge difference.

1

VinnyVeritas t1_j3ng2u9 wrote

Do you have some numbers or a link because all benchmarks I've seen point to the contrary? I'm happy to update my opinion if things have changed and there's data to support it.

1