mfarahmand98 t1_j8r61tr wrote on February 16, 2023 at 10:49 AM

Reply to comment by Kitchen_Tower2800 in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

Care to elaborate?

MustachedSpud t1_j8sacz8 wrote on February 16, 2023 at 4:28 PM

They might be thinking in a different direction than me, but the majority of Memory use during training is not from the model weights or optimizer state in most cases. It comes from tracking all the activations of the training batch. If you think about a cnn, each filter gets used across the whole image so you will have many more activations than filters. So optimizer memory savings has very limited benefits

ChuckSeven t1_j8svm1b wrote on February 16, 2023 at 6:43 PM

those are way less. for every vector of activations you usually have that squared in weights time 2 or 3 depending of how many momentum values you keep.

MustachedSpud t1_j8t25bb wrote on February 16, 2023 at 7:24 PM

Not true, in any case with convolution, attention, or recurrence, which are most modern applications. In all of these cases the activation count grows with how often weights are reused as well as with batch size. Those dominate optimizer memory usage unless you used a tiny batch size.

That's why checkpointing can be useful. This paper does a solid job covering memory usage: https://scholar.google.com/scholar?q=low+memory+neural+network+training+checkpoint&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1676575377350&u=%23p%3DOLSwmmdygaoJ

ChuckSeven t1_j8t5r5m wrote on February 16, 2023 at 7:46 PM

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

MustachedSpud t1_j8t65fh wrote on February 16, 2023 at 7:48 PM

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that