Submitted by Infamous_Age_7731 t3_107pcux in deeplearning
ivan_kudryavtsev t1_j3pssyp wrote
Reply to comment by BellyDancerUrgot in Cloud VM GPU is much slower than my local GPU by Infamous_Age_7731
Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench
to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.
Infamous_Age_7731 OP t1_j3qy6qv wrote
> multithreaded
>
>sysbench
>
> to compare CPU and RAM
Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.
qiltb t1_j3rtytt wrote
that doesn't sound weird to me though, servers use much slower ecc ram probably....
Infamous_Age_7731 OP t1_j3rwt7o wrote
I see, so this shouldn't be causing the issue you reckon.
ivan_kudryavtsev t1_j4c5702 wrote
Ram performance also may be affected by meltdown, spectre patches.
BellyDancerUrgot t1_j3pw2m2 wrote
Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.
ivan_kudryavtsev t1_j3q0c02 wrote
>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.
It looks like no; they speculated about the internal design of A100.
Infamous_Age_7731 OP t1_j3qrkhp wrote
Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.
Viewing a single comment thread. View all comments