TheInfelicitousDandy t1_ir5syrj wrote
Reply to comment by jeankaddour in [R] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging by rlresearcher
It's not really about being suitable, but rather it doesn't follow standard evaluation set-ups. In general, MLM models are trained on some version of book corpus + a wiki dump (and each model tends to use their own version which makes comparisons hard). As such, RoBERTa is really meant to be trained on much more data. That training recipe only uses Wiki103 because it is smallish and easily available. 1) By training on a smaller dataset, you risk introducing a bunch of issues like overfitting and not being hyper-parameter tuned, which for an optimization paper, are kind of important. 2) It also means I can't easily compare it to previous works despite knowing the literature well. I'm pretty sure that a PPL of 4 is really high for a bidirectional model (even if its very low when considering an autoregressive model) 3) PPL isn't even valid for a bidirectional model since it doesn't form a valid probability over a sequence.
I guess this caught my eye (and I don't mean to be rude) because it feels like something someone would do when they don't really know the language modelling literature but just want to use it as a test environment for optimization. The easy solution would just be to use the adapative-inputs LM on wiki103 which is in fairseq and a standard model/dataset combination with reproducible training scripts.
jeankaddour t1_ir66cik wrote
Thank you very much. This is extremely useful feedback and I appreciate your time spent on writing it! I will look into using the adapative-inputs LM on wiki103 the next time. I believe that bookcorpus + a wiki dump will likely not be in my computational budget, but I might try. Your guess of me being new to the LM literature and only wanting to use it as a testbed for optimization is right :) therefore, again, thanks for sharing your insights!
jeankaddour t1_irdvad7 wrote
Thanks again for this feedback. I haven't trained on a different dataset yet, but I replaced all BERT perplexity numbers/plots with the MLM losses in the meantime. The paper has been updated today on Arxiv.
Viewing a single comment thread. View all comments