I am training a DL model locally and on a VM from a private vendor. Locally I have an RTX 3080 Ti (12Gb), on the cloud for more memory I am using an Ampere A100 (80Gb).

I had the feeling that the VM GPU is a bit slow.

So then I used the exact same hyper-params (i.e. batch size etc) and I noticed again that the local RTX 3080Ti is much faster than the A100. When I checked it was 2-3x faster.

Is that because of the card? (for being a server GPU or I saw that the 80Gb is actually 2x40Gb connected with an NVLink, would that be it?). Or is it a standard practice for VM companies to throttle the GPU?

Comments

You must log in or register to comment.

agentfuzzy999 t1_j3pbk38 wrote on January 10, 2023 at 3:21 AM

I have trained locally and in the cloud on a variety of cards and server arch’s, depending on what model you are training it could be for a huge variety of reasons, but if you can fit the model on a 3080 you really aren’t going to be taking advantage of the A100s huge memory, the higher clock speed of the 3080 might simply suit this model and parameter set better.

Infamous_Age_7731 OP t1_j3qrctv wrote on January 10, 2023 at 12:54 PM

Thanks for your advice. FYI, I use the A100 for larger models and/or longer inputs/outputs that don't fit to my 3080.

susoulup t1_j3npxch wrote on January 9, 2023 at 8:56 PM

i’m not into deep learning so fast as too using cloud gpu and local. I have benchmarked one locally and not sure if i did it properly so i really don’t have any advice. my question is if ECC buffering plays a factor in how the data is processed and stored? I thought that was one advantages of using a workstation gpu but i could be way off.

Infamous_Age_7731 OP t1_j3nr9wg wrote on January 9, 2023 at 9:04 PM

I haven't looked into that. I would guess it wouldn't matter in my case, but I might be wrong.

BellyDancerUrgot t1_j3nq5pn wrote on January 9, 2023 at 8:58 PM

There would be a 5-8% overhead for the same gpu in a bare vm vs physical comparison. A100 is significantly faster for ML workloads than a 3090 iirc. So it’s probably something related to how it’s setup in your case. Also try using a single gpu instead of distributed learning if you are. MPI might be leading to more overhead in your compute node.

Infamous_Age_7731 OP t1_j3nqwgy wrote on January 9, 2023 at 9:02 PM

I see, thanks! In that case, I might be asking the vendor more questions.

ivan_kudryavtsev t1_j3pssyp wrote on January 10, 2023 at 5:50 AM

Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.

Infamous_Age_7731 OP t1_j3qy6qv wrote on January 10, 2023 at 1:52 PM

> multithreaded
>
>sysbench
>
> to compare CPU and RAM

Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.

qiltb t1_j3rtytt wrote on January 10, 2023 at 5:22 PM

that doesn't sound weird to me though, servers use much slower ecc ram probably....

Infamous_Age_7731 OP t1_j3rwt7o wrote on January 10, 2023 at 5:39 PM

I see, so this shouldn't be causing the issue you reckon.

ivan_kudryavtsev t1_j4c5702 wrote on January 14, 2023 at 5:45 PM

Ram performance also may be affected by meltdown, spectre patches.

BellyDancerUrgot t1_j3pw2m2 wrote on January 10, 2023 at 6:25 AM

Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

ivan_kudryavtsev t1_j3q0c02 wrote on January 10, 2023 at 7:15 AM

>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

It looks like no; they speculated about the internal design of A100.

Infamous_Age_7731 OP t1_j3qrkhp wrote on January 10, 2023 at 12:56 PM

Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.

No_Cryptographer9806 t1_j3q38ya wrote on January 10, 2023 at 7:52 AM

You might want to check what storage you have on your VM. Slow storage could be a bottleneck. I would suggest to get a high speed SSD.

You should use nvidia-smi or nvtop to monitor GPU usage

Infamous_Age_7731 OP t1_j3qye3r wrote on January 10, 2023 at 1:54 PM

Thank you for your input, it makes sense. Nonetheless, I have adequate RAM and I just checked the IO speed (using sysbench) and actually are pretty much the same with the VM's being a bit faster.

qiltb t1_j3rtr6a wrote on January 10, 2023 at 5:20 PM

Be sure to check logs (i.e. dmesg for starters). Many A100s on AWS for example suffer from memory corruptions which leads to severe degradation in performance. Also check temps.

A single A100 (even the least capable one - 400W with 40GB) should be more of a level of 3090Ti.

You also need to check memory usage (if it's on a limit - like 78.9/80 - there's a problem somewhere). Also don't exclude drivers.

Those are some common headaches when setting up remote GPU instances for DL...

Infamous_Age_7731 OP t1_j3rwor0 wrote on January 10, 2023 at 5:38 PM

Thanks for the input. I just did sudo dmesg --follow and then run my model and I don't see any errors. It just informs it loaded the UVM driver...

The memory is reasonable unless, of course, I match it close to the limit (e.g., batch size).

And what are the "temps"?

qiltb t1_j3uaop0 wrote on January 11, 2023 at 2:43 AM

sorry, temperatures of GPU, CPU etc.

Infamous_Age_7731 OP t1_j3xlhxv wrote on January 11, 2023 at 7:34 PM

Oh yeap, gotcha. They seem fine. The GPU for instance on the Cloud is around 60C.

GPUaccelerated t1_j60zhrx wrote on January 26, 2023 at 11:01 PM

It's simply because the 3080Ti is actually a faster GPU than the A100. The reason the A100 exists is to fit large models without having to parallelize across multiple cards. *For most cases*

Infamous_Age_7731 OP t1_j630pnc wrote on January 27, 2023 at 10:20 AM

Oh i see, thanks that fits my case then!