Viewing a single comment thread. View all comments

alterframe t1_jb6ye5w wrote

Anyone noticed this with weight decay too?

For example here: GIST

It's like larger weight decay provide regularization which lead to slower training as we would expect, but setting lower weight decay makes the training even faster, than the one without any decay at all. I wonder if it may be related.

1

cztomsik t1_jb995yy wrote

And maybe also related to lr decay?

Also interesting thing is random sampling - at least at the start it seems to help when training causal LMs.

1

alterframe t1_jb9i70h wrote

Interesting. With many probabilistic approaches, where we have some intermediate variables in a graph like X -> Z -> Y, we need to introduce sampling on Z to prevent mode collapse. Then we also decay the entropy of this sampler with temperature.

This is quite similar to this early dropout idea, because there we also have some sampling process that effectively works only at the beginning of the training. However, in those other scenarios, we rather attribute it to something like exploration vs. exploitation.

If we had an agent that almost immediately assigns very high probability to a bad initial actions, then it may be never able find a proper solution. On a loss landscape in worst case scenario we can also end up in a local minimum very early on, so we use higher lr at the beginning to make it less likely.

Maybe in general random sampling could be safer than using higher lr? High lr can still fail for some models. If, by parallel, we do it just to boost early exploration, then maybe randomness could be a good alternative. That would kind of counter all claims based on analysis of convex functions...

2

cztomsik t1_jbgexxt wrote

Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)

If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.

So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)

1