Submitted by alex_lite_21 t3_10kjhhb in MachineLearning
[removed]
Submitted by alex_lite_21 t3_10kjhhb in MachineLearning
[removed]
If you have Y = A B X, then is M = A B full rank? If not, then they're not even equivalent.
[deleted]
You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.
Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.
Look up linear bottlenecks.
Dropout is not strictly a linear function (it can be randomly), and the chances are that it will add non-linearity for p>0, so yeah, that probably made the difference.
Sorry to be skeptical but I don't think this is really why your one run was better than the other. I think you also changed something else inadvertently.
Technically, and I emphasize the technically, the set of function represented by a neural network require only one layer. However, there is little guarantee that you can feasibly find the proper configuration or train the network accurately.
By adding another layer, you can reduce the training burden by spreading it across layers. The extra dropout also allows more regularization.
This is the part of deep learning where it's less science and more, "eh, sounds like it works."
What do you mean by "function represented by a neural network"? If you are hinting in the direction of universal approximation, then yes, you can learn any continuous function arbitrarily close with a single layer, sigmoid activation and infinite width. But similarly, there exist some results that show you can achieve a similar statement with a width-limited and "infinite depth" network (the required depth is not infinite but depends on the function you want to approximate and is afaik unbounded over the space of continuous functions). In practice, we are far away from either infinite width or depth so specific configurations can matter.
[deleted]
>I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer.
This is correct: In terms of the functions they can represent, two consecutive linear layers are algebraically equivalent to one linear layer.
LetterRip t1_j5ratja wrote
They learn faster/more easily. You can collapse them down to a single layer after training.