Submitted by Decadz t3_109xqw3 in MachineLearning

There are quite a few papers on optimisation-based meta-learning approaches for learning parameter initialisations (i.e. MAML and its derivatives) [1, 2], and there are also many papers on learning optimisers [3].

Question: Are there any papers which combine the two?

I am aware of some papers such as [4, 5] which achieve this in some capacity indirectly/implicitly, but wondering if there are any other papers that I am not aware of, or do this explicitly? Thanks in advance.

---

[1] Finn, C., et al. (2017. Model-agnostic meta-learning for fast adaptation of deep networks. ICML.)

[2] Nichol, A., et al. (2018. On first-order meta-learning algorithms.)

[3] Andrychowicz, M., et al. (2016. Learning to learn by gradient descent by gradient descent.) NIPS

[4] Li, Z., et al. (2017. Meta-sgd: Learning to learn quickly for few-shot learning.)

[5] Ravi, S., & Larochelle, H. (2016. Optimization as a model for few-shot learning. ICLR.)

13

Comments

You must log in or register to comment.

Jumbofive t1_j41lc4f wrote

Is this close to what you are looking for? https://arxiv.org/abs/2211.09760

5

Decadz OP t1_j41ry5u wrote

Thanks for reminding me of this work! This is an interesting paper, however; as far as I am aware their method only considers the problem of meta-learning an optimizer and not the parameter intialization.

3

Jumbofive t1_j41thj4 wrote

Is there anything to be gained from doing both at once instead of two techniques within the same system? I guess that's what you are trying to find out lol

3

Decadz OP t1_j41wx24 wrote

Yes, as you said that's what im trying to find out! Will be interesting to know whether you can combine the two approaches into one technique, or have two seperate approaches being used in one system.

Was just enquiring to make sure i'm not going to spend time reinventing the wheel haha. Will also be interesting to have some insight into how the two approach interact, and whether the benefits can stack or if they overlap.

2

Optimal-Asshole t1_j41nlj5 wrote

Here’s this paper which uses gradient descent to train the meta layer, and gradient descent to train the hyper parameters of that gradient descent, and so forth. The hyperparameters of the top most meta layer matters less and less as you add meta-depth, I.e add more meta-“layers”.

https://arxiv.org/abs/1909.13371

3

Decadz OP t1_j41udg4 wrote

Thanks for the recommendation! I was unaware of this follow up work, which naturally extends Baydin et al. original work [1]. Categorically, I would consider this paper to be more about meta-optimization (theory), similar to [2, 3]. I was looking for more applied meta-optimization work.

[1] Baydin, A. G., et al. (2017). Online learning rate adaptation with hypergradient descent.

[2] Maclaurin, D., et al. (2015). Gradient-based hyperparameter optimization through reversible learning. ICML.

[3] Lorraine, J., et al (2020). Optimizing millions of hyperparameters by implicit differentiation. AISTATS

2

thchang-opt t1_j41pklt wrote

Is this along the lines of what you were thinking?

https://arxiv.org/abs/2202.00665

3

Decadz OP t1_j41w1zt wrote

Thanks for the suggestion! Brandon Amos has many great pieces of research. The linked paper is quite long, so I will need to have a more complete reading at a later date to be sure. At a glance though, this tutorial is about meta-optimization theory as opposed to what I was originally asking for which is about application of meta-optimization techniques to learning parameter initialisations + optimizers.

3

thchang-opt t1_j42820e wrote

I see, well for what it’s worth, here is what I can remember concerning applications:

I believe one of the motivating applications he was looking at was a control problem, where the current world state was one input and the optimization solution produced the optimal control/action to take from this state, according to some fixed problem dynamics. So he was discussing prediction of next solutions in a sequence of (I think convex) optimization problems for real-time control

2

Decadz OP t1_j433b12 wrote

Great, thanks for the summary!

1

BrisklyBrusque t1_j43dsux wrote

You might enjoy “Well-Tuned Simple Nets Excel on Tabular Data”

https://arxiv.org/abs/2106.11189

Authors wrote a computer routine that leverages BOHB (Bayesian optimization and Hyberband) to search an enormous search space of possible neural network architectures. The authors allowed the routine to select different regularization techniques, including many ensemble techniques like dropout, snapshot ensembles, and others that render the choice of parameter initializations less critical. However, authors used the same optimizer (AdamW) in all experiments.

Not exactly what you are looking for but hopefully interesting.

3

Decadz OP t1_j45h9oj wrote

Thanks for the suggestion, I’ll take a read!

1