Viewing a single comment thread. View all comments

PassionatePossum t1_izi24ow wrote

You can still parallelize using batch gradient descent. If you for example use the MirroredStrategy in Tensorflow you split up the batch between multiple GPUs. The only downside is, that this approach doesn’t scale well if you want to train on more than one machine since the model needs to be synced after each iteration.

But you should think long and hard whether training on multiple machines is really necessary since that brings a whole new set of problems. 700GB is not that large. We do that all the time. I don’t know what kind of model you are trying to train but we have a GPU Server with 8 GPUs and I’ve never felt the need to go beyond the normal MirroredStrategy for parallelization. And should you run into the problem that you cannot fit the data onto the machine where you are training: Load it over the network.

You just need to make sure that your input pipeline supports that efficiently. Shard your dataset so you can have many concurrent I/O operations.

And in case scaling is really important to you. May I suggest you look into Horovod?

2