ThomasBudd93

ThomasBudd93 t1_j7u0ttw wrote

Someone else directed me here:

https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/12

I didn't read the whole thread and there is more in the NVIDIA forum (links can be found in the pytorch forum link from above). At least a few weeks ago it looked like the multi-GPU training for the RTX 4090s doesn't work fully where it does for the RTX 6000 Ada. Not sure if this is intended or just a bug. I called a company here in Germany and they even stopped selling multi RTX 4090 deep learning computers because of this. I asked them about the multi-GPU benchmark I saw from lambda labs and they replied that they reproduced that but saw that the training only resulted in nans. This is all I know. If you find out more, could you share it here? :) Thanks!

2

ThomasBudd93 t1_ixgzy0h wrote

I've build my entire library in pytorch and was amazing. I don't regret it at all. If you have a long term project or want to learn more about coding and DL implementations, it is a good choice I think.

For the context: I knew the project that I was doing for the next few years in my PhD. I looked at the current SOTA implementations and didn't like certain aspects/saw a lot of work coming ahead if I was going to pursue with it. I read the code a lot and spend a single day just thinking about my design before writing a line of code myself. I learnt a lot in this time. If you do that you will get the chance to learn and think about all the little technical details hidden in current frameworks.

After having finished this, I was able to adapt my code quickly to new ideas I wanted to try. Also switching from messy iPython notebooks to my library made my work more reproducible.

This was my experience, but it will differ from case to case. I would say if you have a long term project, start you career and have the time, definetly do it! You will learn a lot and have a code basis you understand heads to toes and you can rely on. Otherwise I would reconsider this.

Hope that helps!

6

ThomasBudd93 t1_is5fwt4 wrote

We also have to wait for the improvements by using fp8 kicks in. NVIDIA has recently published a paper demonstrating that it is feasible to train with fp8 and the new tensor cores are compaitble with that format. Just the software isn't there yet.

8