ivan_kudryavtsev t1_j3pssyp wrote on January 10, 2023 at 5:50 AM

Reply to comment by BellyDancerUrgot in Cloud VM GPU is much slower than my local GPU by Infamous_Age_7731

Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.

Infamous_Age_7731 OP t1_j3qy6qv wrote on January 10, 2023 at 1:52 PM

> multithreaded
>
>sysbench
>
> to compare CPU and RAM

Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.

qiltb t1_j3rtytt wrote on January 10, 2023 at 5:22 PM

that doesn't sound weird to me though, servers use much slower ecc ram probably....

Infamous_Age_7731 OP t1_j3rwt7o wrote on January 10, 2023 at 5:39 PM

I see, so this shouldn't be causing the issue you reckon.

ivan_kudryavtsev t1_j4c5702 wrote on January 14, 2023 at 5:45 PM

Ram performance also may be affected by meltdown, spectre patches.

BellyDancerUrgot t1_j3pw2m2 wrote on January 10, 2023 at 6:25 AM

Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

ivan_kudryavtsev t1_j3q0c02 wrote on January 10, 2023 at 7:15 AM

>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

It looks like no; they speculated about the internal design of A100.

Infamous_Age_7731 OP t1_j3qrkhp wrote on January 10, 2023 at 12:56 PM

Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.