LahmacunBear
LahmacunBear OP t1_j3mrexi wrote
Reply to comment by SatoshiNotMe in [R] Learning Learning-Rates: SteDy Optimizer by LahmacunBear
Mine’s in Tensorflow 2.11 — I’m sure writing a PyTorch version wouldn’t be hard. The extra lines of the algorithm are three lines in my paper. I can share my code though?
LahmacunBear OP t1_j3l5ub2 wrote
Reply to comment by resented_ape in [R] Learning Learning-Rates: SteDy Optimizer by LahmacunBear
Oh damn, that paper almost does exactly what I do. Huh. Oh well. Slightly different implementation though. I in contrast, use both grads from the same timestep and have an accumulated Ct.
LahmacunBear t1_jdo7k0w wrote
Reply to [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.