Submitted by alexnasla t3_yikumt in MachineLearning
Hey there,
So im trying to figure out how to significantly speed up my training (trying to 10x it) and im trying to figure out whats going on here. Im using PyTorch for framework and 4 sequential layers, Dense+conv1d+lstm+dense. I have a batch size of 80,000 and ran it on a K80 vs A100, I only saw a 14% increase in performance. In the given time frame the K80 completed 1400 Epochs and the A100 completed about 1600 Epochs. To me this likely means what im trying to do is NOT being bound by the GPU at all, as the hardware should have accounted for something like a 30x increase in performance yeah? I dont think RAM is the issue, the A100 has 80GB of HMB2 VRAM, more than what I ever use. So if its not GPU power, and not RAM. Its either CPU or Storage?
It seems I need to parallelize the training in order to get the speed im looking for?
Anyone have any insight?
fnbr t1_iuj8h11 wrote
Have you profiled your code? That would be the first thing I would do.
What sort of utilization of the GPU are you getting?
It's likely you're bottlenecked by feeding data in- for supervised learning, that's often the case.
I'm happy to offer suggestions for feeding data in if you're using Tensorflow/JAX.