MustachedSpud t1_j8t65fh wrote on February 16, 2023 at 7:48 PM

Reply to comment by ChuckSeven in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that

MustachedSpud t1_j8t25bb wrote on February 16, 2023 at 7:24 PM

Reply to comment by ChuckSeven in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

Not true, in any case with convolution, attention, or recurrence, which are most modern applications. In all of these cases the activation count grows with how often weights are reused as well as with batch size. Those dominate optimizer memory usage unless you used a tiny batch size.

That's why checkpointing can be useful. This paper does a solid job covering memory usage: https://scholar.google.com/scholar?q=low+memory+neural+network+training+checkpoint&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1676575377350&u=%23p%3DOLSwmmdygaoJ

MustachedSpud t1_j8sacz8 wrote on February 16, 2023 at 4:28 PM

Reply to comment by mfarahmand98 in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

They might be thinking in a different direction than me, but the majority of Memory use during training is not from the model weights or optimizer state in most cases. It comes from tracking all the activations of the training batch. If you think about a cnn, each filter gets used across the whole image so you will have many more activations than filters. So optimizer memory savings has very limited benefits

MustachedSpud t1_j8s9pid wrote on February 16, 2023 at 4:24 PM

Reply to comment by bernhard-lehner in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

Wait that's so much better

MustachedSpud t1_j35gn03 wrote on January 6, 2023 at 4:01 AM

Reply to comment by horselover_f4t in [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434

Are you trolling me or something? YOU are the person I responded to. YOU brought up the vanilla version, in a response to someone else who was talking about the zero version. The zero version is most relevant here because it learns from scratch, without human knowledge.

MustachedSpud t1_j34qzvu wrote on January 6, 2023 at 12:57 AM

Reply to comment by horselover_f4t in [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434

The person two comments up was talking about the zero version. Thread is about how AI can surpass humans and the point is they already can if they have a way to improve without human data

MustachedSpud t1_j31glqv wrote on January 5, 2023 at 12:03 PM

Reply to comment by horselover_f4t in [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434

The zero in alpha zero means it starts with no human knowledge. They figures out that this approach is eventually stronger than the base alpha go strategy.

MustachedSpud t1_iun145l wrote on November 1, 2022 at 4:00 PM

Reply to [D] Is there a way we can score "popularity" on social media posts? by Seankala

Those sound sensible to me. Only other data that I can think of using would be engagement rate, like how many comments/likes per view.