Submitted by Dartagnjan t3_10ee9kp in MachineLearning

Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function.

OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function.

Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all?

I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to.

I am interested in results related to this question.

11

Comments

You must log in or register to comment.

SetentaeBolg t1_j4qimm0 wrote

There are mathematical proofs of convergence for a single perceptron matching a linear classification, but for more realistic modern neural nets, I don't believe there are any proofs guaranteeing general convergence because I don't think convergence is actually guaranteed, for the reason pointed out, you can't be certain gradient descent will find the "right" minima.

8

Dartagnjan OP t1_j4qs8zx wrote

Thank for confirming my suspicions. Do you happen to have a reference for that case when optimizations methods influence optimization in such a way to inhibit convergence to some better set of minimas?

3

[deleted] t1_j4qu0xg wrote

> I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to.

Some examples of NN failing to learn would be constant initializations. Especially easy to see with zero initialization.

https://medium.com/@safrin1128/weight-initialization-in-neural-network-inspired-by-andrew-ng-e0066dc4a566

​

As for a general framework. If you're familiar with the [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem). Particular papers discuss convergence rates - https://proceedings.neurips.cc/paper/2020/file/2000f6325dfc4fc3201fc45ed01c7a5d-Paper.pdf

I think it would be a function of your particular problem. I've seen this examined from the perspective of learning frameworks as well such as PAC learning. Reading into that may answer some of your specific questions. I'm not aware of any general result outside of some comments on bounding generalization error based on data segmentation.

At a blush, your question about knowing you'll reach the minimum in K steps feels halting problem-ish. So I'd have to think about it later to convince myself fully.

5

jostmey t1_j4vmqcs wrote

A deep neural network trained by backpropagation will converge to a local minimum if you use gradient descent or even stochastic gradient descent. However, there are many components added to a deep neural network like dropout and batch normalization, which as far as I know, do not come with convergence guarantees.

There are no guarantees about finding a global minimum

1