Viewing a single comment thread. View all comments

tysam_and_co t1_jb3i6eq wrote

Right, right, right, though I don't see how dropout introduces bias into the network. Sure, we're subsampling the network in general, but overall the information integrated with respect to a minibatch should be less on the whole due to gradient noise, right? So the bias should be less and as a result we have more uncertainty, then more steps equals more integration time of course and on we go from there towards that elusive less-biased estimator.

I guess the sticking point is _how_ they're saying that dropout induces bias. I feel like fitting quickly in a non-regularized setting has more bias by default, because I believe the 0-centered noise should end up diluting the loss signal. I think. Right? I find this all very strange.


Hiitstyty t1_jb4wjnj wrote

It helps to think of the bias-variance trade off in terms of the hypothesis space. Dropout trains subnetworks at every iteration. The hypothesis space of the full network will always contain (and be larger) than the hypothesis space of any subnetwork, because the full network has greater expressive capacity. Thus, the full network can not be any less biased than any subnetwork. However, any subnetwork will have reduced variance because of its smaller relative hypothesis space. Thus, dropout helps because its reduction in variance offsets its increase in bias. However, as the dropout proportion is set increasingly higher, eventually the bias will be too great to overcome.