Optimal-Asshole t1_j41nlj5 wrote on January 12, 2023 at 3:44 PM

Here’s this paper which uses gradient descent to train the meta layer, and gradient descent to train the hyper parameters of that gradient descent, and so forth. The hyperparameters of the top most meta layer matters less and less as you add meta-depth, I.e add more meta-“layers”.

https://arxiv.org/abs/1909.13371

Decadz OP t1_j41udg4 wrote on January 12, 2023 at 4:28 PM

Thanks for the recommendation! I was unaware of this follow up work, which naturally extends Baydin et al. original work [1]. Categorically, I would consider this paper to be more about meta-optimization (theory), similar to [2, 3]. I was looking for more applied meta-optimization work.

[1] Baydin, A. G., et al. (2017). Online learning rate adaptation with hypergradient descent.

[2] Maclaurin, D., et al. (2015). Gradient-based hyperparameter optimization through reversible learning. ICML.

[3] Lorraine, J., et al (2020). Optimizing millions of hyperparameters by implicit differentiation. AISTATS