Viewing a single comment thread. View all comments

BellyDancerUrgot t1_j3nq5pn wrote

There would be a 5-8% overhead for the same gpu in a bare vm vs physical comparison. A100 is significantly faster for ML workloads than a 3090 iirc. So it’s probably something related to how it’s setup in your case. Also try using a single gpu instead of distributed learning if you are. MPI might be leading to more overhead in your compute node.

2

Infamous_Age_7731 OP t1_j3nqwgy wrote

I see, thanks! In that case, I might be asking the vendor more questions.

2

ivan_kudryavtsev t1_j3pssyp wrote

Why so? GPUs are passed to VM in a pass-through mode, so no significant performance pitfails must happen. I recommend OP to look at CPU %steal, nvidia-smi (maybe it is A100 1/7 shard, not a full GPU). Run a single and multithreaded sysbench to compare CPU and RAM. Also, PCI-E generation or deficated bandwidth can be outperforming on your hardware if a cloud provider uses a not well-balanced custom build.

2

Infamous_Age_7731 OP t1_j3qy6qv wrote

> multithreaded
>
>sysbench
>
> to compare CPU and RAM

Thanks a lot for your input! I checked the CPU %steal it seems optimal ranging from 0.0 to 0.1st. Then, I don't think it's a shard since in the NVIDIA I have the full 80Gb memory at my disposal (unless they do some trickeries). I did a series of `sysbench tests and I found out that the VM's CPU is slightly worse for single-thread performance, but what is more astounding is the RAM speed. For 1 or 8 threads the write is 0.8x slower and the read is 1.5x slower. The Ram speed drop seems to reflect the iteration per second speed drop when I train the model. I guess this might be the fault.

2

qiltb t1_j3rtytt wrote

that doesn't sound weird to me though, servers use much slower ecc ram probably....

1

BellyDancerUrgot t1_j3pw2m2 wrote

Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

1

ivan_kudryavtsev t1_j3q0c02 wrote

>Oh I thought maybe he is going for distributed learning since he has access to 2 GPUs. In that case MPI has some overhead simply because it has to replicate, scatter and gather all the gradients per batch every epoch.

It looks like no; they speculated about the internal design of A100.

1

Infamous_Age_7731 OP t1_j3qrkhp wrote

Yes indeed, I am not doing anything in parallel. I use them separately and I wanted to compare their internal design as you said.

1