Viewing a single comment thread. View all comments

qiltb t1_j3rtr6a wrote

Be sure to check logs (i.e. dmesg for starters). Many A100s on AWS for example suffer from memory corruptions which leads to severe degradation in performance. Also check temps.

A single A100 (even the least capable one - 400W with 40GB) should be more of a level of 3090Ti.

You also need to check memory usage (if it's on a limit - like 78.9/80 - there's a problem somewhere). Also don't exclude drivers.

Those are some common headaches when setting up remote GPU instances for DL...

1

Infamous_Age_7731 OP t1_j3rwor0 wrote

Thanks for the input. I just did sudo dmesg --follow and then run my model and I don't see any errors. It just informs it loaded the UVM driver...

The memory is reasonable unless, of course, I match it close to the limit (e.g., batch size).

And what are the "temps"?

1

qiltb t1_j3uaop0 wrote

sorry, temperatures of GPU, CPU etc.

1

Infamous_Age_7731 OP t1_j3xlhxv wrote

Oh yeap, gotcha. They seem fine. The GPU for instance on the Cloud is around 60C.

1