Comments

You must log in or register to comment.

LetterRip t1_j5ratja wrote

They learn faster/more easily. You can collapse them down to a single layer after training.

3

HateRedditCantQuitit t1_j5r2kt5 wrote

If you have Y = A B X, then is M = A B full rank? If not, then they're not even equivalent.

2

[deleted] t1_j5r3opg wrote

[deleted]

1

HateRedditCantQuitit t1_j5r5f69 wrote

You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.

Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.

Look up linear bottlenecks.

3

suflaj t1_j5r4u61 wrote

Dropout is not strictly a linear function (it can be randomly), and the chances are that it will add non-linearity for p>0, so yeah, that probably made the difference.

2

tornado28 t1_j5rdebv wrote

Sorry to be skeptical but I don't think this is really why your one run was better than the other. I think you also changed something else inadvertently.

2

gunshoes t1_j5r241t wrote

Technically, and I emphasize the technically, the set of function represented by a neural network require only one layer. However, there is little guarantee that you can feasibly find the proper configuration or train the network accurately.

By adding another layer, you can reduce the training burden by spreading it across layers. The extra dropout also allows more regularization.

This is the part of deep learning where it's less science and more, "eh, sounds like it works."

1

arg_max t1_j5r8qe6 wrote

What do you mean by "function represented by a neural network"? If you are hinting in the direction of universal approximation, then yes, you can learn any continuous function arbitrarily close with a single layer, sigmoid activation and infinite width. But similarly, there exist some results that show you can achieve a similar statement with a width-limited and "infinite depth" network (the required depth is not infinite but depends on the function you want to approximate and is afaik unbounded over the space of continuous functions). In practice, we are far away from either infinite width or depth so specific configurations can matter.

1

PredictorX1 t1_j5rb8gp wrote

>I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer.

This is correct: In terms of the functions they can represent, two consecutive linear layers are algebraically equivalent to one linear layer.

1