Submitted by radi-cho t3_11izjc1 in MachineLearning
alterframe t1_jcn87ue wrote
Do you have any explanation why on Figure 9 the training loss decrease slower for the early dropout? The previous sections are all about how reducing variance in the mini-batch gradients, allows us to travel longer distance in the hyperparameter space (Figure 1 from the post). It seems that it is not reflected in the value of the loss.
Any idea why? It catches up very quickly after the dropout is turned off, but I'm still curious about this behavior.
Viewing a single comment thread. View all comments