Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function.

OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function.

Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all?

I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to.

I am interested in results related to this question.

Comments

You must log in or register to comment.

SetentaeBolg t1_j4qimm0 wrote on January 17, 2023 at 3:50 PM

#1,386,257

There are mathematical proofs of convergence for a single perceptron matching a linear classification, but for more realistic modern neural nets, I don't believe there are any proofs guaranteeing general convergence because I don't think convergence is actually guaranteed, for the reason pointed out, you can't be certain gradient descent will find the "right" minima.

Dartagnjan OP t1_j4qs8zx wrote on January 17, 2023 at 4:52 PM

#1,386,908

Replying to SetentaeBolg (#1,386,257)

Thank for confirming my suspicions. Do you happen to have a reference for that case when optimizations methods influence optimization in such a way to inhibit convergence to some better set of minimas?

[deleted] t1_j4qu0xg wrote on January 17, 2023 at 5:03 PM

#1,387,020

> I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to.

Some examples of NN failing to learn would be constant initializations. Especially easy to see with zero initialization.

https://medium.com/@safrin1128/weight-initialization-in-neural-network-inspired-by-andrew-ng-e0066dc4a566

As for a general framework. If you're familiar with the [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem). Particular papers discuss convergence rates - https://proceedings.neurips.cc/paper/2020/file/2000f6325dfc4fc3201fc45ed01c7a5d-Paper.pdf

I think it would be a function of your particular problem. I've seen this examined from the perspective of learning frameworks as well such as PAC learning. Reading into that may answer some of your specific questions. I'm not aware of any general result outside of some comments on bounding generalization error based on data segmentation.

At a blush, your question about knowing you'll reach the minimum in K steps feels halting problem-ish. So I'd have to think about it later to convince myself fully.

gtancev t1_j4rrpm6 wrote on January 17, 2023 at 8:29 PM

#1,388,957

Double descent may also be of interest.

rikkajounin t1_j4umb8q wrote on January 18, 2023 at 11:04 AM

#1,394,380

The following work shows that with sufficiently large width (overparameterized regime) you can have polynomial convergence to the global minimum which gets worse (but polynomially) with the depth of the network.

A Convergence Theory for Deep Learning via Over-Parameterization

buffleswaffles t1_j4upyq7 wrote on January 18, 2023 at 11:46 AM

#1,394,570

You should check out cs229m stanford

[deleted] t1_j4vmj5c wrote on January 18, 2023 at 4:03 PM

#1,396,494

[deleted]

jostmey t1_j4vmqcs wrote on January 18, 2023 at 4:04 PM

#1,396,506

A deep neural network trained by backpropagation will converge to a local minimum if you use gradient descent or even stochastic gradient descent. However, there are many components added to a deep neural network like dropout and batch normalization, which as far as I know, do not come with convergence guarantees.

There are no guarantees about finding a global minimum

serge_cell t1_j5akgwk wrote on January 21, 2023 at 4:28 PM

#1,427,684

Yes for specific cases and mostly overly strong assumptions. It was talked about a lot several years ago and in this same subreddit too. For example:

https://arxiv.org/abs/1810.02054

https://arxiv.org/abs/1811.03804

https://arxiv.org/abs/1811.03962

https://arxiv.org/abs/1811.08888

This is recurring question, people asking it every year. Some papers should be made sticky :(

[deleted] t1_j5al8p3 wrote on January 21, 2023 at 4:33 PM

#1,427,722

Replying to SetentaeBolg (#1,386,257)

[deleted]