https://preview.redd.it/whgggirj3fia1.png?width=936&format=png&auto=webp&v=enabled&s=ae3dee45ec6b2472fd42af849138b41c88ed39de

Seems interesting. A snippet from the Arxiv page:

>Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks.

Links

Arxiv: https://arxiv.org/abs/2302.06675

Code Implementation: https://github.com/lucidrains/lion-pytorch

Comments

currentscurrents t1_j8op44d wrote on February 15, 2023 at 9:19 PM

#1,822,512

Does it though? There was a reproducibility survey recently that found that many optimizers claiming better performance did not in fact work for anything other than the tasks tested in their papers.

Essentially they were doing hyperparameter tuning - just the hyperparameter was the optimizer design itself.

zdss t1_j8osnth wrote on February 15, 2023 at 9:41 PM

#1,822,759

I've just skimmed the paper, but this is a confusing result. I can see a simpler optimizer paying off when using similar amounts of computing due to being able to run more iterations, but they claim it's also better on a per-iteration basis across the entire learning task. There's not a lot going on in this algorithm, so where is the magic coming from?

It's kind of hard to believe that while people were experimenting with all these more complex optimizers no one tried something this simple and saw that it had state-of-the-art results.

Jean-Porte t1_j8oswiy wrote on February 15, 2023 at 9:42 PM

#1,822,779

I'm waiting for deberta glue/superglue results, it's weird that they picked T5 for that

MadScientist-1214 t1_j8ox26g wrote on February 15, 2023 at 10:09 PM

#1,823,074

Better than AdamW if (a) the model is a transformer, (b) not a lot of augmentations are used. Otherwise, the improvements are not that large. I doubt this optimizer works well with regular CNNs like efficientnet or convnext.

Red-Portal t1_j8p86z3 wrote on February 15, 2023 at 11:27 PM

#1,824,071

Do learned optimizer people seriously believe this is the direction we should be going?

currentscurrents t1_j8qa46s wrote on February 16, 2023 at 4:17 AM

#1,826,834

Replying to Red-Portal (#1,824,071)

This is a hand-designed optimizer. By definition, learned optimizer researchers would rather we learn an optimizer than hand-design one.

Learned optimizers are probably the future, but the compute budget required to create one is prohibitive.

Competitive_Dog_6639 t1_j8qa7em wrote on February 16, 2023 at 4:17 AM

#1,826,839

ML acronyms are getting out of hand, just use any letter from any of the words I guess...

Kitchen_Tower2800 t1_j8qhsuy wrote on February 16, 2023 at 5:30 AM

#1,827,317

"It is more memory-efficient than Adam as it only keeps track of the momentum."

While this is technically true, is this a joke?

Downchuck t1_j8qk6r0 wrote on February 16, 2023 at 5:56 AM

#1,827,459

u/ExponentialCookie - In the Code Implementation link, lucidrains writes about reproducibility issues and tuning, both issues brought up in these comments.

Seankala t1_j8r2317 wrote on February 16, 2023 at 9:52 AM

#1,828,334

Replying to currentscurrents (#1,822,512)

> ...just the hyperparameter was the optimizer design itself.

Probably one of the best things I've read today lol. Reminds me of when old colleagues of mine would have lists of different PyTorch optimizers and just loop through them.

mfarahmand98 t1_j8r61tr wrote on February 16, 2023 at 10:49 AM

#1,828,550

Replying to Kitchen_Tower2800 (#1,827,317)

Care to elaborate?

bernhard-lehner t1_j8r9z7j wrote on February 16, 2023 at 11:38 AM

#1,828,744

Replying to Competitive_Dog_6639 (#1,826,839)

I would have named it "Eve", as she came after Adam (if you are into these stories)

Icy_Touch_4556 t1_j8rm26l wrote on February 16, 2023 at 1:37 PM

#1,829,628

Replying to bernhard-lehner (#1,828,744)

That would have been a cool name!

MustachedSpud t1_j8s9pid wrote on February 16, 2023 at 4:24 PM

#1,831,595

Replying to bernhard-lehner (#1,828,744)

Wait that's so much better

MustachedSpud t1_j8sacz8 wrote on February 16, 2023 at 4:28 PM

#1,831,642

Replying to mfarahmand98 (#1,828,550)

They might be thinking in a different direction than me, but the majority of Memory use during training is not from the model weights or optimizer state in most cases. It comes from tracking all the activations of the training batch. If you think about a cnn, each filter gets used across the whole image so you will have many more activations than filters. So optimizer memory savings has very limited benefits

[deleted] t1_j8scgfw wrote on February 16, 2023 at 4:42 PM

#1,831,797

Replying to bernhard-lehner (#1,828,744)

[deleted]

ChuckSeven t1_j8svm1b wrote on February 16, 2023 at 6:43 PM

#1,833,265

Replying to MustachedSpud (#1,831,642)

those are way less. for every vector of activations you usually have that squared in weights time 2 or 3 depending of how many momentum values you keep.

Competitive_Dog_6639 t1_j8swbzz wrote on February 16, 2023 at 6:48 PM

#1,833,315

Replying to bernhard-lehner (#1,828,744)

EVolved sign momEntum (EVE) 🤣

MustachedSpud t1_j8t25bb wrote on February 16, 2023 at 7:24 PM

#1,833,771

Replying to ChuckSeven (#1,833,265)

Not true, in any case with convolution, attention, or recurrence, which are most modern applications. In all of these cases the activation count grows with how often weights are reused as well as with batch size. Those dominate optimizer memory usage unless you used a tiny batch size.

That's why checkpointing can be useful. This paper does a solid job covering memory usage: https://scholar.google.com/scholar?q=low+memory+neural+network+training+checkpoint&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1676575377350&u=%23p%3DOLSwmmdygaoJ

ChuckSeven t1_j8t5r5m wrote on February 16, 2023 at 7:46 PM

#1,834,032

Replying to MustachedSpud (#1,833,771)

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

MustachedSpud t1_j8t65fh wrote on February 16, 2023 at 7:48 PM

#1,834,069

Replying to ChuckSeven (#1,834,032)

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that

LeanderKu t1_j8vxlqa wrote on February 17, 2023 at 9:34 AM

#1,841,305

Replying to Red-Portal (#1,824,071)

I think learned optimizers have potential but this is disappointing. Nothing revolutionary in there…there are already sign based optimizers and this is just a slightly different take. I see learned optimizers as the possibility of getting unintuitive results but this just could have been thrown together by some grad student. Random but not surprising.

CyberDainz t1_j8weqb8 wrote on February 17, 2023 at 12:58 PM

#1,842,405

so, technically this is binary optimizer that updates the weight to either -1 or +1 multiplied by lr. Should be tested with "Learning Rate Dropout", i.e. 30% chance to update with -1/+1, otherwise no update.

Berzerka t1_j913vha wrote on February 18, 2023 at 12:51 PM

#1,855,873

Replying to Kitchen_Tower2800 (#1,827,317)

Ever heard of large models?

andreichiffa t1_j95d303 wrote on February 19, 2023 at 10:40 AM

#1,868,455

I really think we need an intermediate between conference papers and arxiv, to just evaluate how reproducible/sane the paper is without evaluating whether it is important or not.

Because at this stage I genuinely can't tell if that's a press release, a report in a paper form, or an actual paper.

CoderHD t1_j989j2g wrote on February 20, 2023 at 12:33 AM

#1,877,049

Replying to MadScientist-1214 (#1,823,074)

In my limited testing on a UNet like CNN, it doesnt even come close to the performance of adam sadly. With that said, i might be doing something wrong.

[D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms

Links

Comments

currentscurrents t1_j8op44d wrote on February 15, 2023 at 9:19 PM

zdss t1_j8osnth wrote on February 15, 2023 at 9:41 PM

Jean-Porte t1_j8oswiy wrote on February 15, 2023 at 9:42 PM

MadScientist-1214 t1_j8ox26g wrote on February 15, 2023 at 10:09 PM

Red-Portal t1_j8p86z3 wrote on February 15, 2023 at 11:27 PM

currentscurrents t1_j8qa46s wrote on February 16, 2023 at 4:17 AM

Competitive_Dog_6639 t1_j8qa7em wrote on February 16, 2023 at 4:17 AM

Kitchen_Tower2800 t1_j8qhsuy wrote on February 16, 2023 at 5:30 AM

Downchuck t1_j8qk6r0 wrote on February 16, 2023 at 5:56 AM

Seankala t1_j8r2317 wrote on February 16, 2023 at 9:52 AM

mfarahmand98 t1_j8r61tr wrote on February 16, 2023 at 10:49 AM

bernhard-lehner t1_j8r9z7j wrote on February 16, 2023 at 11:38 AM

Icy_Touch_4556 t1_j8rm26l wrote on February 16, 2023 at 1:37 PM

MustachedSpud t1_j8s9pid wrote on February 16, 2023 at 4:24 PM

MustachedSpud t1_j8sacz8 wrote on February 16, 2023 at 4:28 PM

[deleted] t1_j8scgfw wrote on February 16, 2023 at 4:42 PM

ChuckSeven t1_j8svm1b wrote on February 16, 2023 at 6:43 PM

Competitive_Dog_6639 t1_j8swbzz wrote on February 16, 2023 at 6:48 PM

MustachedSpud t1_j8t25bb wrote on February 16, 2023 at 7:24 PM

ChuckSeven t1_j8t5r5m wrote on February 16, 2023 at 7:46 PM

MustachedSpud t1_j8t65fh wrote on February 16, 2023 at 7:48 PM

HumanSpinach2 t1_j8u513j wrote on February 16, 2023 at 11:30 PM

LeanderKu t1_j8vxlqa wrote on February 17, 2023 at 9:34 AM

CyberDainz t1_j8weqb8 wrote on February 17, 2023 at 12:58 PM

Berzerka t1_j913vha wrote on February 18, 2023 at 12:51 PM

andreichiffa t1_j95d303 wrote on February 19, 2023 at 10:40 AM

CoderHD t1_j989j2g wrote on February 20, 2023 at 12:33 AM