Viewing a single comment thread. View all comments

f_max t1_izhg9j7 wrote

If you have more than 1 gpu and your model is small enough to fit on 1 gpu, distributed data parallel is the go to. Basically multiple model instances training, with gradients synchronized at end of each batch. PyTorch has it integrated. And probably so does TF.

2