Comments

You must log in or register to comment.

ClearlyCylindrical t1_iqna0cr wrote

If it were possible to do full batch all the time minibatches would likely still be used. The stochasticity created by minibatch gradient descent generally improves a models generalisation performance.

26

Ephemeral_Epoch t1_iqnscns wrote

Seems like you could approximate a minibatch with a full batch + noise? Maybe there's a better noising procedure when using full batch gradients.

5

SNAPscientist t1_iqr3sej wrote

Capturing the distribution characteristics of high-dimensional data is very hard. In fact if we could do that well, we might be able to use classic bayesian techniques for many NN problems which would be more principled and interpretable. Any noise one would end up adding by hand is unlikely to introduce the kind of stochasticity that sampling on real data (using minibatches or similar procedures) would. Getting the distribution wrong would likely mean poor generalization.

2

fasttosmile t1_iqrel80 wrote

This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg

The real reason is it's just faster to train on smaller batches (because the steps are quicker).

2

ClearlyCylindrical t1_iqrmrxz wrote

Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?

1

fasttosmile t1_iqrolwa wrote

There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.

1

UnusualClimberBear t1_iqna2rw wrote

The full gradient does not work well for NN. Plus adam has a coarse estimate of the curvature, so it would be more of a second-order method even if you can find some functions where the proposed estimates are not good.

8

dasayan05 t1_iqnltrg wrote

Mini-batches are not here just for memory limitations. They inject noise in the optimization which helps escape local minimas and explores the loss landscape.

3

029187 OP t1_iqp6s79 wrote

what if, as another poster said, we did full batch but also injected noise into it?

1

dasayan05 t1_iqp8hnf wrote

possible. but what is the advantage with that ? even if we did find a way to explicitly noise the data/gradient, we are still better off with mini-batches as they offer less memory consumption

2

029187 OP t1_iqrinm2 wrote

If its only as good, then it has no benefit. But if it ends up being better, then it is useful for situations where we have enough memory.

​

https://arxiv.org/abs/2103.17182

​

This paper here is claiming they might have found interesting ways to potentially make it better.

1

Red-Portal t1_iqpczjb wrote

People have tried it, and so far no one has been able to achieve the same effect. It's still somewhat of an open research problem.

1

gdahl t1_iqpf8j8 wrote

Adam is more likely to outperform steepest descent (full batch GD) in the full batch setting than it is to outperform SGD at batch size 1.

2

suflaj t1_iqns8dp wrote

To add on what others have said, you would still likely want mini-batches to better track progress. Even if we had infinite memory there is still a limit to how fast you can process information (even at physical extremes), and so you would not be able to do these operations instantly. Unless there were significant drawbacks to using minibatches, you'd probably take over minibatches with seconds or minutes per update over a hanging loop that updates every X hours.

1

Creepy-Tackle-944 t1_iqow79w wrote

Hard to answer.
A few years ago my answer would be a resounding "hell no", back in those days a batch size of 64 is considered large.

Today training configurations of top-performing models are commonly in the ballpark 4096 images per batch, which I never thought I would see.
This kind of shows that batch size does not really exist in a vacuum but rather coexists with other parameters. For efficiency doing everything in one batch would be desirable since everything is in RAM. However, actually doing so would require coming up with some entirely new set of parameters.
Also, gradient accumulation is a thing, and you can theoretically have a train epoch as a single batch without running OOM, but nobody found that to be effective yet.

1

crrrr30 t1_iqpciu1 wrote

I feel like with that memory available, testing scaling laws is a better research direction than testing full batch

1

Cheap_Meeting t1_iqq8oku wrote

Adding to other answers: Even if you had enough memory, if it would still be computationally inefficient. There is a diminishing return from increasing batch size in terms of how much the loss improves each step.

1