I have a big data set that I would like to train on, so my thought is that I am going to do distributed training , but I am currently setting up MultiWorkerMirroredStrategy on tensorflow and i find it hard to use even with https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor

https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy

So I was wondering if there are other recommended way of doing NN training if you have big dataset?

Comments

f_max t1_izhg9j7 wrote on December 9, 2022 at 3:18 AM

If you have more than 1 gpu and your model is small enough to fit on 1 gpu, distributed data parallel is the go to. Basically multiple model instances training, with gradients synchronized at end of each batch. PyTorch has it integrated. And probably so does TF.

linearmodality t1_izgkb2p wrote on December 8, 2022 at 11:11 PM

How big is your dataset? The right answer will depend on the size.

IdeaEnough443 OP t1_izgr37h wrote on December 9, 2022 at 12:01 AM

greater than 700GB potentially 10TB scale, it won't fit on single machine memory

ab3rratic t1_izgrfxy wrote on December 9, 2022 at 12:03 AM

Batch gradient descent (the usual method) does not require the entire dataset to fit into memory -- only one batch, as it were.

IdeaEnough443 OP t1_izgwvp5 wrote on December 9, 2022 at 12:45 AM

but the training process would be slower than parallelization? is batch gradient descent the industry standard for handling large dataset in nn training?

ab3rratic t1_izh1s2j wrote on December 9, 2022 at 1:24 AM

See "deep learning".

PassionatePossum t1_izi24ow wrote on December 9, 2022 at 6:45 AM

You can still parallelize using batch gradient descent. If you for example use the MirroredStrategy in Tensorflow you split up the batch between multiple GPUs. The only downside is, that this approach doesn’t scale well if you want to train on more than one machine since the model needs to be synced after each iteration.

But you should think long and hard whether training on multiple machines is really necessary since that brings a whole new set of problems. 700GB is not that large. We do that all the time. I don’t know what kind of model you are trying to train but we have a GPU Server with 8 GPUs and I’ve never felt the need to go beyond the normal MirroredStrategy for parallelization. And should you run into the problem that you cannot fit the data onto the machine where you are training: Load it over the network.

You just need to make sure that your input pipeline supports that efficiently. Shard your dataset so you can have many concurrent I/O operations.

And in case scaling is really important to you. May I suggest you look into Horovod?

SwordOfVarjo t1_izgx533 wrote on December 9, 2022 at 12:47 AM

It's the industry standard for NN training period. Your dataset isn't that big, just train on one machine.

IdeaEnough443 OP t1_izgyjq8 wrote on December 9, 2022 at 12:58 AM

our datset take close to a day to finish training, if we have 5x the data it won't work with our application, thats why we are trying to see if distributed training would help lower training time