qiltb t1_j3rtr6a wrote on January 10, 2023 at 5:20 PM

Be sure to check logs (i.e. dmesg for starters). Many A100s on AWS for example suffer from memory corruptions which leads to severe degradation in performance. Also check temps.

A single A100 (even the least capable one - 400W with 40GB) should be more of a level of 3090Ti.

You also need to check memory usage (if it's on a limit - like 78.9/80 - there's a problem somewhere). Also don't exclude drivers.

Those are some common headaches when setting up remote GPU instances for DL...

Infamous_Age_7731 OP t1_j3rwor0 wrote on January 10, 2023 at 5:38 PM

Thanks for the input. I just did sudo dmesg --follow and then run my model and I don't see any errors. It just informs it loaded the UVM driver...

The memory is reasonable unless, of course, I match it close to the limit (e.g., batch size).

And what are the "temps"?

qiltb t1_j3uaop0 wrote on January 11, 2023 at 2:43 AM

sorry, temperatures of GPU, CPU etc.

Infamous_Age_7731 OP t1_j3xlhxv wrote on January 11, 2023 at 7:34 PM

Oh yeap, gotcha. They seem fine. The GPU for instance on the Cloud is around 60C.