Submitted by radi-cho t3_11izjc1 in MachineLearning
cztomsik t1_jb995yy wrote
Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
And maybe also related to lr decay?
Also interesting thing is random sampling - at least at the start it seems to help when training causal LMs.
alterframe t1_jb9i70h wrote
Interesting. With many probabilistic approaches, where we have some intermediate variables in a graph like X -> Z -> Y
, we need to introduce sampling on Z
to prevent mode collapse. Then we also decay the entropy of this sampler with temperature.
This is quite similar to this early dropout idea, because there we also have some sampling process that effectively works only at the beginning of the training. However, in those other scenarios, we rather attribute it to something like exploration vs. exploitation.
If we had an agent that almost immediately assigns very high probability to a bad initial actions, then it may be never able find a proper solution. On a loss landscape in worst case scenario we can also end up in a local minimum very early on, so we use higher lr
at the beginning to make it less likely.
Maybe in general random sampling could be safer than using higher lr
? High lr
can still fail for some models. If, by parallel, we do it just to boost early exploration, then maybe randomness could be a good alternative. That would kind of counter all claims based on analysis of convex functions...
cztomsik t1_jbgexxt wrote
Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)
If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.
So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)
Viewing a single comment thread. View all comments