Submitted by Infamous_Age_7731 t3_107pcux in deeplearning

I am training a DL model locally and on a VM from a private vendor. Locally I have an RTX 3080 Ti (12Gb), on the cloud for more memory I am using an Ampere A100 (80Gb).

I had the feeling that the VM GPU is a bit slow.

So then I used the exact same hyper-params (i.e. batch size etc) and I noticed again that the local RTX 3080Ti is much faster than the A100. When I checked it was 2-3x faster.

Is that because of the card? (for being a server GPU or I saw that the 80Gb is actually 2x40Gb connected with an NVLink, would that be it?). Or is it a standard practice for VM companies to throttle the GPU?

5

Comments

You must log in or register to comment.

susoulup t1_j3npxch wrote

i’m not into deep learning so fast as too using cloud gpu and local. I have benchmarked one locally and not sure if i did it properly so i really don’t have any advice. my question is if ECC buffering plays a factor in how the data is processed and stored? I thought that was one advantages of using a workstation gpu but i could be way off.

2

BellyDancerUrgot t1_j3nq5pn wrote

There would be a 5-8% overhead for the same gpu in a bare vm vs physical comparison. A100 is significantly faster for ML workloads than a 3090 iirc. So it’s probably something related to how it’s setup in your case. Also try using a single gpu instead of distributed learning if you are. MPI might be leading to more overhead in your compute node.

2

agentfuzzy999 t1_j3pbk38 wrote

I have trained locally and in the cloud on a variety of cards and server arch’s, depending on what model you are training it could be for a huge variety of reasons, but if you can fit the model on a 3080 you really aren’t going to be taking advantage of the A100s huge memory, the higher clock speed of the 3080 might simply suit this model and parameter set better.

4

ivan_kudryavtsev t1_j3pssyp wrote

Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.

2

ivan_kudryavtsev t1_j3q0c02 wrote

>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

It looks like no; they speculated about the internal design of A100.

1

No_Cryptographer9806 t1_j3q38ya wrote

You might want to check what storage you have on your VM. Slow storage could be a bottleneck. I would suggest to get a high speed SSD.

You should use nvidia-smi or nvtop to monitor GPU usage

1

Infamous_Age_7731 OP t1_j3qy6qv wrote

> multithreaded
>
>sysbench
>
> to compare CPU and RAM

Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.

2

qiltb t1_j3rtr6a wrote

Be sure to check logs (i.e. dmesg for starters). Many A100s on AWS for example suffer from memory corruptions which leads to severe degradation in performance. Also check temps.

A single A100 (even the least capable one - 400W with 40GB) should be more of a level of 3090Ti.

You also need to check memory usage (if it's on a limit - like 78.9/80 - there's a problem somewhere). Also don't exclude drivers.

Those are some common headaches when setting up remote GPU instances for DL...

1

Infamous_Age_7731 OP t1_j3rwor0 wrote

Thanks for the input. I just did sudo dmesg --follow and then run my model and I don't see any errors. It just informs it loaded the UVM driver...

The memory is reasonable unless, of course, I match it close to the limit (e.g., batch size).

And what are the "temps"?

1

GPUaccelerated t1_j60zhrx wrote

It's simply because the 3080Ti is actually a faster GPU than the A100. The reason the A100 exists is to fit large models without having to parallelize across multiple cards. *For most cases*

1