Submitted by radi-cho t3_11izjc1 in MachineLearning
cztomsik t1_jbgexxt wrote
Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)
If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.
So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)
Viewing a single comment thread. View all comments