I am working on 2nd order optimizer with Hessian estimator from online MLE linear regression of gradients, mostly updating 4 exponential moving averages: of (theta, g, theta*g, theta^2). Here is simple 2D Beale function example, after 30 steps it gets ~50x smaller values than momentum: https://github.com/JarekDuda/SGD-OGR-Hessian-estimator/raw/main/OGR%20beale.png

I wanted to propose a discussion about various 2nd order approaches using only gradients - I am aware of: conjugated gradients, quasi-Newton especially L-BFGS, Gauss-Newton.

Any others? Which one seems the more practical to expand for NN training?

How to transform them to high dimension? I thought about building 2nd order model on updated e.g. 10 dimensional locally interesting space e.g. from online PCA of gradients, and in the remaining directions still use e.g. momentum.

How to optimally use such estimated Hessian - especially handle very low and negative eigenvalues? (abs, div&cut above)

Slides with gathered various approach (any interesting missing?): https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf

Derivation of this OGR Hessian estimator: https://arxiv.org/pdf/1901.11457

Comments

You must log in or register to comment.

Red-Portal t1_iyprcgo wrote on December 3, 2022 at 4:55 AM

Isn't what you implemented more or less a variant of BFGS? Stochastic BFGS is very well known to not work very well on deep neural networks.

jarekduda OP t1_iypsbet wrote on December 3, 2022 at 5:04 AM

Indeed BFGS seems the closest to my approach (OGR), but it is relatively costly: needs many matrix products per step, uses only a few gradients per step, and they have the same weights.

In contrast, OGR is literally online linear regression of gradients, per step updates 4 averages and e.g. use eigendecompositon (can be done cheaper), uses exponentially weakening weights - focusing on local situation, using all previous gradients ... also should be compatible with slow evolution of considered local interesting subspace.

SufficientStautistic t1_iyq8qsw wrote on December 3, 2022 at 8:30 AM

Levenberg-Marquardt

jarekduda OP t1_iyq9dj9 wrote on December 3, 2022 at 8:39 AM

It is regularized Gauss-Newton, which is generally quite suspicious: approximates Hessian with positive defined ... for extremely non-convex function.

How does it change the landscape of extrema?

Is it used for NN training? K-FAC uses kind of related Fisher information approximation to positive defined.

serge_cell t1_iyv2zag wrote on December 4, 2022 at 11:36 AM

3D Localization/Registration/Reconstruction are traditional area of use for regularized Gauss-Newton and all are highly non-convex. The trick is to strat in nearly-convex area, sometimes after several tries, and/or convexify with regularizers and/or sensors fusion.

K-FAC seems stable enough but quite complex in implementation. It's identical to low-dimentional-blocks approximation of Gauss-Newton. Fisher information is only decoration.

asingov t1_iz491sb wrote on December 6, 2022 at 10:35 AM

The Neurips paper "Regularised Non-linear Acceleration" might be relevant.