Submitted by Infamous_Age_7731 t3_107pcux in deeplearning
BellyDancerUrgot t1_j3nq5pn wrote
There would be a 5-8% overhead for the same gpu in a bare vm vs physical comparison. A100 is significantly faster for ML workloads than a 3090 iirc. So it’s probably something related to how it’s setup in your case. Also try using a single gpu instead of distributed learning if you are. MPI might be leading to more overhead in your compute node.
Infamous_Age_7731 OP t1_j3nqwgy wrote
I see, thanks! In that case, I might be asking the vendor more questions.
ivan_kudryavtsev t1_j3pssyp wrote
Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench
to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.
Infamous_Age_7731 OP t1_j3qy6qv wrote
> multithreaded
>
>sysbench
>
> to compare CPU and RAM
Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.
qiltb t1_j3rtytt wrote
that doesn't sound weird to me though, servers use much slower ecc ram probably....
Infamous_Age_7731 OP t1_j3rwt7o wrote
I see, so this shouldn't be causing the issue you reckon.
ivan_kudryavtsev t1_j4c5702 wrote
Ram performance also may be affected by meltdown, spectre patches.
BellyDancerUrgot t1_j3pw2m2 wrote
Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.
ivan_kudryavtsev t1_j3q0c02 wrote
>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.
It looks like no; they speculated about the internal design of A100.
Infamous_Age_7731 OP t1_j3qrkhp wrote
Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.
Viewing a single comment thread. View all comments