Kon-kkk t1_iuj7nyz wrote on October 31, 2022 at 7:04 PM

What framework
What kind of network/model
Try to reduce the CPU-GPU data transitions during training.

Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.

alexnasla OP t1_iuj8se6 wrote on October 31, 2022 at 7:12 PM

Oh my bad!

PyTorch
Its 4 sequential layers, Dense+conv1d+lstm+dense
Hmm any resources you know of I can check out to learn more about doing that?

BlazeObsidian t1_iujbbdu wrote on October 31, 2022 at 7:29 PM

Are you sure you model is running on the GPU ? See https://towardsdatascience.com/pytorch-switching-to-the-gpu-a7c0b21e8a99 or if you can see GPU utilisation it might be simpler to verify.

If you are not explicitly moving your model to the GPU I think it's running on the CPU. Also how long is it taking ? Do you have a specific time that you compared the performance with ?

alexnasla OP t1_iujbukx wrote on October 31, 2022 at 7:33 PM

Im pretty sure its running on the GPU. I dont remember what the GPU utilization was though, ill take a look when I get a chance.

The test that I mentioned ran for 8 hours.

K-o-s-l-s t1_iujldkh wrote on October 31, 2022 at 8:37 PM

What are you using to log and monitor your jobs? Knowing CPU, RAM, and GPU utilisation will make this a lot easier to understand.

I agree with the poster above; no appreciable speed up switching between a k80 and an a100 makes me suspect that the GPU is not being utilised at all.

alexnasla OP t1_iujn3mm wrote on October 31, 2022 at 8:49 PM

Ok so what I did was actual max out the input buffers to the most the GPU can handle without crashing. So basically fully saturating the VRAM.

JustOneAvailableName t1_iujqrr1 wrote on October 31, 2022 at 9:15 PM

> Its 4 sequential layers, Dense+conv1d+lstm+dense

I thinks this is not enough to saturate the A100. Try to 10x the batch size by just repeating the data. Useless for training, but it should increase GPU utilization without increasing disk utilization. Handy to confirm the bottleneck