cztomsik t1_jbgexxt wrote on March 8, 2023 at 9:25 PM

Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Another interesting idea might be to start training with smaller context len (and bigger batch size - together with random sampling)

If you think about it, people also learn the noun-verb pairs first and then go to sentences and then to longer paragraphs/articles, etc. And it's also good if we have a lot of variance at this early stages.

So it makes some sense, BERT MLM is also very similar to what people do when learning languages :)