There are quite a few papers on optimisation-based meta-learning approaches for learning parameter initialisations (i.e. MAML and its derivatives) [1, 2], and there are also many papers on learning optimisers [3].

Question: Are there any papers which combine the two?

I am aware of some papers such as [4, 5] which achieve this in some capacity indirectly/implicitly, but wondering if there are any other papers that I am not aware of, or do this explicitly? Thanks in advance.

---

[1] Finn, C., et al. (2017. Model-agnostic meta-learning for fast adaptation of deep networks. ICML.)

[2] Nichol, A., et al. (2018. On first-order meta-learning algorithms.)

[3] Andrychowicz, M., et al. (2016. Learning to learn by gradient descent by gradient descent.) NIPS

[4] Li, Z., et al. (2017. Meta-sgd: Learning to learn quickly for few-shot learning.)

[5] Ravi, S., & Larochelle, H. (2016. Optimization as a model for few-shot learning. ICLR.)

Comments

You must log in or register to comment.

Jumbofive t1_j41lc4f wrote on January 12, 2023 at 3:29 PM

Is this close to what you are looking for? https://arxiv.org/abs/2211.09760

Decadz OP t1_j41ry5u wrote on January 12, 2023 at 4:12 PM

Thanks for reminding me of this work! This is an interesting paper, however; as far as I am aware their method only considers the problem of meta-learning an optimizer and not the parameter intialization.

Jumbofive t1_j41thj4 wrote on January 12, 2023 at 4:22 PM

Is there anything to be gained from doing both at once instead of two techniques within the same system? I guess that's what you are trying to find out lol

Decadz OP t1_j41wx24 wrote on January 12, 2023 at 4:44 PM

Yes, as you said that's what im trying to find out! Will be interesting to know whether you can combine the two approaches into one technique, or have two seperate approaches being used in one system.

Was just enquiring to make sure i'm not going to spend time reinventing the wheel haha. Will also be interesting to have some insight into how the two approach interact, and whether the benefits can stack or if they overlap.

Optimal-Asshole t1_j41nlj5 wrote on January 12, 2023 at 3:44 PM

Here’s this paper which uses gradient descent to train the meta layer, and gradient descent to train the hyper parameters of that gradient descent, and so forth. The hyperparameters of the top most meta layer matters less and less as you add meta-depth, I.e add more meta-“layers”.

https://arxiv.org/abs/1909.13371

Decadz OP t1_j41udg4 wrote on January 12, 2023 at 4:28 PM

Thanks for the recommendation! I was unaware of this follow up work, which naturally extends Baydin et al. original work [1]. Categorically, I would consider this paper to be more about meta-optimization (theory), similar to [2, 3]. I was looking for more applied meta-optimization work.

[1] Baydin, A. G., et al. (2017). Online learning rate adaptation with hypergradient descent.

[2] Maclaurin, D., et al. (2015). Gradient-based hyperparameter optimization through reversible learning. ICML.

[3] Lorraine, J., et al (2020). Optimizing millions of hyperparameters by implicit differentiation. AISTATS

thchang-opt t1_j41pklt wrote on January 12, 2023 at 3:57 PM

Is this along the lines of what you were thinking?

https://arxiv.org/abs/2202.00665

Decadz OP t1_j41w1zt wrote on January 12, 2023 at 4:38 PM

Thanks for the suggestion! Brandon Amos has many great pieces of research. The linked paper is quite long, so I will need to have a more complete reading at a later date to be sure. At a glance though, this tutorial is about meta-optimization theory as opposed to what I was originally asking for which is about application of meta-optimization techniques to learning parameter initialisations + optimizers.

thchang-opt t1_j42820e wrote on January 12, 2023 at 5:52 PM

I see, well for what it’s worth, here is what I can remember concerning applications:

I believe one of the motivating applications he was looking at was a control problem, where the current world state was one input and the optimization solution produced the optimal control/action to take from this state, according to some fixed problem dynamics. So he was discussing prediction of next solutions in a sequence of (I think convex) optimization problems for real-time control

Decadz OP t1_j433b12 wrote on January 12, 2023 at 9:02 PM

Great, thanks for the summary!

[deleted] t1_j4332ri wrote on January 12, 2023 at 9:01 PM

[deleted]

BrisklyBrusque t1_j43dsux wrote on January 12, 2023 at 10:04 PM

You might enjoy “Well-Tuned Simple Nets Excel on Tabular Data”

https://arxiv.org/abs/2106.11189

Authors wrote a computer routine that leverages BOHB (Bayesian optimization and Hyberband) to search an enormous search space of possible neural network architectures. The authors allowed the routine to select different regularization techniques, including many ensemble techniques like dropout, snapshot ensembles, and others that render the choice of parameter initializations less critical. However, authors used the same optimizer (AdamW) in all experiments.

Not exactly what you are looking for but hopefully interesting.

Decadz OP t1_j45h9oj wrote on January 13, 2023 at 8:11 AM

Thanks for the suggestion, I’ll take a read!