Submitted by ExponentialCookie t3_1138jpp in MachineLearning

​

https://preview.redd.it/whgggirj3fia1.png?width=936&format=png&auto=webp&v=enabled&s=ae3dee45ec6b2472fd42af849138b41c88ed39de

Seems interesting. A snippet from the Arxiv page:

>Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks.

Links

Arxiv: https://arxiv.org/abs/2302.06675

Code Implementation: https://github.com/lucidrains/lion-pytorch

43

Comments

You must log in or register to comment.

currentscurrents t1_j8op44d wrote

Does it though? There was a reproducibility survey recently that found that many optimizers claiming better performance did not in fact work for anything other than the tasks tested in their papers.

Essentially they were doing hyperparameter tuning - just the hyperparameter was the optimizer design itself.

64

Seankala t1_j8r2317 wrote

> ...just the hyperparameter was the optimizer design itself.

Probably one of the best things I've read today lol. Reminds me of when old colleagues of mine would have lists of different PyTorch optimizers and just loop through them.

18

Competitive_Dog_6639 t1_j8qa7em wrote

ML acronyms are getting out of hand, just use any letter from any of the words I guess...

41

bernhard-lehner t1_j8r9z7j wrote

I would have named it "Eve", as she came after Adam (if you are into these stories)

14

MadScientist-1214 t1_j8ox26g wrote

Better than AdamW if (a) the model is a transformer, (b) not a lot of augmentations are used. Otherwise, the improvements are not that large. I doubt this optimizer works well with regular CNNs like efficientnet or convnext.

21

CoderHD t1_j989j2g wrote

In my limited testing on a UNet like CNN, it doesnt even come close to the performance of adam sadly. With that said, i might be doing something wrong.

3

zdss t1_j8osnth wrote

I've just skimmed the paper, but this is a confusing result. I can see a simpler optimizer paying off when using similar amounts of computing due to being able to run more iterations, but they claim it's also better on a per-iteration basis across the entire learning task. There's not a lot going on in this algorithm, so where is the magic coming from?

It's kind of hard to believe that while people were experimenting with all these more complex optimizers no one tried something this simple and saw that it had state-of-the-art results.

10

Kitchen_Tower2800 t1_j8qhsuy wrote

"It is more memory-efficient than Adam as it only keeps track of the momentum."

While this is technically true, is this a joke?

8

mfarahmand98 t1_j8r61tr wrote

Care to elaborate?

2

MustachedSpud t1_j8sacz8 wrote

They might be thinking in a different direction than me, but the majority of Memory use during training is not from the model weights or optimizer state in most cases. It comes from tracking all the activations of the training batch. If you think about a cnn, each filter gets used across the whole image so you will have many more activations than filters. So optimizer memory savings has very limited benefits

3

ChuckSeven t1_j8svm1b wrote

those are way less. for every vector of activations you usually have that squared in weights time 2 or 3 depending of how many momentum values you keep.

1

MustachedSpud t1_j8t25bb wrote

Not true, in any case with convolution, attention, or recurrence, which are most modern applications. In all of these cases the activation count grows with how often weights are reused as well as with batch size. Those dominate optimizer memory usage unless you used a tiny batch size.

That's why checkpointing can be useful. This paper does a solid job covering memory usage: https://scholar.google.com/scholar?q=low+memory+neural+network+training+checkpoint&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1676575377350&u=%23p%3DOLSwmmdygaoJ

2

ChuckSeven t1_j8t5r5m wrote

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

3

MustachedSpud t1_j8t65fh wrote

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that

1

Jean-Porte t1_j8oswiy wrote

I'm waiting for deberta glue/superglue results, it's weird that they picked T5 for that

3

Downchuck t1_j8qk6r0 wrote

u/ExponentialCookie - In the Code Implementation link, lucidrains writes about reproducibility issues and tuning, both issues brought up in these comments.

2

CyberDainz t1_j8weqb8 wrote

so, technically this is binary optimizer that updates the weight to either -1 or +1 multiplied by lr. Should be tested with "Learning Rate Dropout", i.e. 30% chance to update with -1/+1, otherwise no update.

2

andreichiffa t1_j95d303 wrote

I really think we need an intermediate between conference papers and arxiv, to just evaluate how reproducible/sane the paper is without evaluating whether it is important or not.

Because at this stage I genuinely can't tell if that's a press release, a report in a paper form, or an actual paper.

2

Red-Portal t1_j8p86z3 wrote

Do learned optimizer people seriously believe this is the direction we should be going?

1

LeanderKu t1_j8vxlqa wrote

I think learned optimizers have potential but this is disappointing. Nothing revolutionary in there…there are already sign based optimizers and this is just a slightly different take. I see learned optimizers as the possibility of getting unintuitive results but this just could have been thrown together by some grad student. Random but not surprising.

1