BitShin

BitShin t1_ivsjvcr wrote

In addition to providing a smoother descend, computing the average gradient over many samples is very parallelizable. So on a modern GPU, taking the gradient at a single point vs 50 points is not 50 times more expensive. In fact, if your model is small enough, they could take roughly the same amount of time.

1