Hey there,

So im trying to figure out how to significantly speed up my training (trying to 10x it) and im trying to figure out whats going on here. Im using PyTorch for framework and 4 sequential layers, Dense+conv1d+lstm+dense. I have a batch size of 80,000 and ran it on a K80 vs A100, I only saw a 14% increase in performance. In the given time frame the K80 completed 1400 Epochs and the A100 completed about 1600 Epochs. To me this likely means what im trying to do is NOT being bound by the GPU at all, as the hardware should have accounted for something like a 30x increase in performance yeah? I dont think RAM is the issue, the A100 has 80GB of HMB2 VRAM, more than what I ever use. So if its not GPU power, and not RAM. Its either CPU or Storage?

It seems I need to parallelize the training in order to get the speed im looking for?

Anyone have any insight?

Comments

You must log in or register to comment.

fnbr t1_iuj8h11 wrote on October 31, 2022 at 7:10 PM

Have you profiled your code? That would be the first thing I would do.

What sort of utilization of the GPU are you getting?

It's likely you're bottlenecked by feeding data in- for supervised learning, that's often the case.

I'm happy to offer suggestions for feeding data in if you're using Tensorflow/JAX.

alexnasla OP t1_iuj9596 wrote on October 31, 2022 at 7:14 PM

So right now the bottleneck is such where I need to speed up the training time to about 10 times to be able to match the sampling time that with the training time and to be able to sample and train at the same time without the bottleneck.

Historical_Ad2338 t1_iujw7fv wrote on October 31, 2022 at 9:54 PM

LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.

patient_zer00 t1_iujl1if wrote on October 31, 2022 at 8:35 PM

Disc IO is often a bootleneck.

Also, even though using a GPU will increase training speed with LSTMs, too, the computation of the gradient relies on the whole sequence to be processed each sequence step after the other, which can't be parallelized. That's probably why your speed increase is not that big using a K80 vs a A100.

Edit: typos

Kon-kkk t1_iuj7nyz wrote on October 31, 2022 at 7:04 PM

What framework
What kind of network/model
Try to reduce the CPU-GPU data transitions during training.

Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.

alexnasla OP t1_iuj8se6 wrote on October 31, 2022 at 7:12 PM

Oh my bad!

PyTorch
Its 4 sequential layers, Dense+conv1d+lstm+dense
Hmm any resources you know of I can check out to learn more about doing that?

BlazeObsidian t1_iujbbdu wrote on October 31, 2022 at 7:29 PM

Are you sure you model is running on the GPU ? See https://towardsdatascience.com/pytorch-switching-to-the-gpu-a7c0b21e8a99 or if you can see GPU utilisation it might be simpler to verify.

If you are not explicitly moving your model to the GPU I think it's running on the CPU. Also how long is it taking ? Do you have a specific time that you compared the performance with ?

alexnasla OP t1_iujbukx wrote on October 31, 2022 at 7:33 PM

Im pretty sure its running on the GPU. I dont remember what the GPU utilization was though, ill take a look when I get a chance.

The test that I mentioned ran for 8 hours.

K-o-s-l-s t1_iujldkh wrote on October 31, 2022 at 8:37 PM

What are you using to log and monitor your jobs? Knowing CPU, RAM, and GPU utilisation will make this a lot easier to understand.

I agree with the poster above; no appreciable speed up switching between a k80 and an a100 makes me suspect that the GPU is not being utilised at all.

alexnasla OP t1_iujn3mm wrote on October 31, 2022 at 8:49 PM

Ok so what I did was actual max out the input buffers to the most the GPU can handle without crashing. So basically fully saturating the VRAM.

JustOneAvailableName t1_iujqrr1 wrote on October 31, 2022 at 9:15 PM

> Its 4 sequential layers, Dense+conv1d+lstm+dense

I thinks this is not enough to saturate the A100. Try to 10x the batch size by just repeating the data. Useless for training, but it should increase GPU utilization without increasing disk utilization. Handy to confirm the bottleneck

m98789 t1_iujdl6v wrote on October 31, 2022 at 7:44 PM

If using an A100, ensure you have enabled AMP, it’s a game changer in training speed up:

https://developer.nvidia.com/automatic-mixed-precision

Also suggest using a very fast disk, get the fastest you can. Disk IO can surprisingly be a bottleneck.