I am training a DL model locally and on a VM from a private vendor. Locally I have an RTX 3080 Ti (12Gb), on the cloud for more memory I am using an Ampere A100 (80Gb).

I had the feeling that the VM GPU is a bit slow.

So then I used the exact same hyper-params (i.e. batch size etc) and I noticed again that the local RTX 3080Ti is much faster than the A100. When I checked it was 2-3x faster.

Is that because of the card? (for being a server GPU or I saw that the 80Gb is actually 2x40Gb connected with an NVLink, would that be it?). Or is it a standard practice for VM companies to throttle the GPU?

Comments

susoulup t1_j3npxch wrote on January 9, 2023 at 8:56 PM

#1,319,048

i’m not into deep learning so fast as too using cloud gpu and local. I have benchmarked one locally and not sure if i did it properly so i really don’t have any advice. my question is if ECC buffering plays a factor in how the data is processed and stored? I thought that was one advantages of using a workstation gpu but i could be way off.

BellyDancerUrgot t1_j3nq5pn wrote on January 9, 2023 at 8:58 PM

#1,319,061

There would be a 5-8% overhead for the same gpu in a bare vm vs physical comparison. A100 is significantly faster for ML workloads than a 3090 iirc. So it’s probably something related to how it’s setup in your case. Also try using a single gpu instead of distributed learning if you are. MPI might be leading to more overhead in your compute node.

Infamous_Age_7731 OP t1_j3nqwgy wrote on January 9, 2023 at 9:02 PM

#1,319,107

Replying to BellyDancerUrgot (#1,319,061)

I see, thanks! In that case, I might be asking the vendor more questions.

Infamous_Age_7731 OP t1_j3nr9wg wrote on January 9, 2023 at 9:04 PM

#1,319,133

Replying to susoulup (#1,319,048)

I haven't looked into that. I would guess it wouldn't matter in my case, but I might be wrong.

agentfuzzy999 t1_j3pbk38 wrote on January 10, 2023 at 3:21 AM

#1,322,001

I have trained locally and in the cloud on a variety of cards and server arch’s, depending on what model you are training it could be for a huge variety of reasons, but if you can fit the model on a 3080 you really aren’t going to be taking advantage of the A100s huge memory, the higher clock speed of the 3080 might simply suit this model and parameter set better.

ivan_kudryavtsev t1_j3pssyp wrote on January 10, 2023 at 5:50 AM

#1,322,787

Replying to BellyDancerUrgot (#1,319,061)

Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.

BellyDancerUrgot t1_j3pw2m2 wrote on January 10, 2023 at 6:25 AM

#1,322,963

Replying to ivan_kudryavtsev (#1,322,787)

Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

ivan_kudryavtsev t1_j3q0c02 wrote on January 10, 2023 at 7:15 AM

#1,323,149

Replying to BellyDancerUrgot (#1,322,963)

>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

It looks like no; they speculated about the internal design of A100.

No_Cryptographer9806 t1_j3q38ya wrote on January 10, 2023 at 7:52 AM

#1,323,274

You might want to check what storage you have on your VM. Slow storage could be a bottleneck. I would suggest to get a high speed SSD.

You should use nvidia-smi or nvtop to monitor GPU usage

Infamous_Age_7731 OP t1_j3qrctv wrote on January 10, 2023 at 12:54 PM

#1,324,243

Replying to agentfuzzy999 (#1,322,001)

Thanks for your advice. FYI, I use the A100 for larger models and/or longer inputs/outputs that don't fit to my 3080.

Infamous_Age_7731 OP t1_j3qrkhp wrote on January 10, 2023 at 12:56 PM

#1,324,256

Replying to ivan_kudryavtsev (#1,323,149)

Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.

Infamous_Age_7731 OP t1_j3qy6qv wrote on January 10, 2023 at 1:52 PM

#1,324,622

Replying to ivan_kudryavtsev (#1,322,787)

> multithreaded
>
>sysbench
>
> to compare CPU and RAM

Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.

Infamous_Age_7731 OP t1_j3qye3r wrote on January 10, 2023 at 1:54 PM

#1,324,632

Replying to No_Cryptographer9806 (#1,323,274)

Thank you for your input, it makes sense. Nonetheless, I have adequate RAM and I just checked the IO speed (using sysbench) and actually are pretty much the same with the VM's being a bit faster.

qiltb t1_j3rtr6a wrote on January 10, 2023 at 5:20 PM

#1,326,314

Be sure to check logs (i.e. dmesg for starters). Many A100s on AWS for example suffer from memory corruptions which leads to severe degradation in performance. Also check temps.

A single A100 (even the least capable one - 400W with 40GB) should be more of a level of 3090Ti.

You also need to check memory usage (if it's on a limit - like 78.9/80 - there's a problem somewhere). Also don't exclude drivers.

Those are some common headaches when setting up remote GPU instances for DL...

qiltb t1_j3rtytt wrote on January 10, 2023 at 5:22 PM

#1,326,327

Replying to Infamous_Age_7731 (#1,324,622)

that doesn't sound weird to me though, servers use much slower ecc ram probably....

Infamous_Age_7731 OP t1_j3rwor0 wrote on January 10, 2023 at 5:38 PM

#1,326,463

Replying to qiltb (#1,326,314)

Thanks for the input. I just did sudo dmesg --follow and then run my model and I don't see any errors. It just informs it loaded the UVM driver...

The memory is reasonable unless, of course, I match it close to the limit (e.g., batch size).

And what are the "temps"?

Infamous_Age_7731 OP t1_j3rwt7o wrote on January 10, 2023 at 5:39 PM

#1,326,466

Replying to qiltb (#1,326,327)

I see, so this shouldn't be causing the issue you reckon.

qiltb t1_j3uaop0 wrote on January 11, 2023 at 2:43 AM

#1,330,292

Replying to Infamous_Age_7731 (#1,326,463)

sorry, temperatures of GPU, CPU etc.

Infamous_Age_7731 OP t1_j3xlhxv wrote on January 11, 2023 at 7:34 PM

#1,334,801

Replying to qiltb (#1,330,292)

Oh yeap, gotcha. They seem fine. The GPU for instance on the Cloud is around 60C.

ivan_kudryavtsev t1_j4c5702 wrote on January 14, 2023 at 5:45 PM

#1,361,311

Replying to Infamous_Age_7731 (#1,324,622)

Ram performance also may be affected by meltdown, spectre patches.

GPUaccelerated t1_j60zhrx wrote on January 26, 2023 at 11:01 PM

#1,516,547

It's simply because the 3080Ti is actually a faster GPU than the A100. The reason the A100 exists is to fit large models without having to parallelize across multiple cards. *For most cases*

Infamous_Age_7731 OP t1_j630pnc wrote on January 27, 2023 at 10:20 AM

#1,524,963

Replying to GPUaccelerated (#1,516,547)

Oh i see, thanks that fits my case then!