Submitted by Dartagnjan t3_10ee9kp in MachineLearning
Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function.
OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function.
Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all?
I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to.
I am interested in results related to this question.
SetentaeBolg t1_j4qimm0 wrote
There are mathematical proofs of convergence for a single perceptron matching a linear classification, but for more realistic modern neural nets, I don't believe there are any proofs guaranteeing general convergence because I don't think convergence is actually guaranteed, for the reason pointed out, you can't be certain gradient descent will find the "right" minima.