Viewing a single comment thread. View all comments

ThomasBudd93 t1_j7teaca wrote

Last time I checked there were problems with running on training job on two or more RTX 4090s. Did these problems solve out? There were some posts in the pytorch and NVIDIA forums.

2

one_eyed_sphinx OP t1_j7tzqmj wrote

can you find me a reference?

1

ThomasBudd93 t1_j7u0ttw wrote

Someone else directed me here:

https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/12

I didn't read the whole thread and there is more in the NVIDIA forum (links can be found in the pytorch forum link from above). At least a few weeks ago it looked like the multi-GPU training for the RTX 4090s doesn't work fully where it does for the RTX 6000 Ada. Not sure if this is intended or just a bug. I called a company here in Germany and they even stopped selling multi RTX 4090 deep learning computers because of this. I asked them about the multi-GPU benchmark I saw from lambda labs and they replied that they reproduced that but saw that the training only resulted in nans. This is all I know. If you find out more, could you share it here? :) Thanks!

2

one_eyed_sphinx OP t1_j7yr8st wrote

some of the people seem to connect it to the AMD processors and motherboards. do you think it's the reason?
nvidia is known to downgrade thier gaming GPU so people will buy the proffessional ones.

1