Viewing a single comment thread. View all comments

Optimal-Asshole t1_j41nlj5 wrote

Here’s this paper which uses gradient descent to train the meta layer, and gradient descent to train the hyper parameters of that gradient descent, and so forth. The hyperparameters of the top most meta layer matters less and less as you add meta-depth, I.e add more meta-“layers”.

https://arxiv.org/abs/1909.13371

3

Decadz OP t1_j41udg4 wrote

Thanks for the recommendation! I was unaware of this follow up work, which naturally extends Baydin et al. original work [1]. Categorically, I would consider this paper to be more about meta-optimization (theory), similar to [2, 3]. I was looking for more applied meta-optimization work.

[1] Baydin, A. G., et al. (2017). Online learning rate adaptation with hypergradient descent.

[2] Maclaurin, D., et al. (2015). Gradient-based hyperparameter optimization through reversible learning. ICML.

[3] Lorraine, J., et al (2020). Optimizing millions of hyperparameters by implicit differentiation. AISTATS

2