TheInfelicitousDandy

TheInfelicitousDandy t1_ixi0q8a wrote

Also not to mention the entire process that happens after finding the revolutionary idea where you have to look at it with enough squint to be able to say 'this would be applicable to this other domain or task if we just change a few things', then the 8 pages of explanation you need to write up explaining how the changes are not novel but direct and banal outcomes of the original work (when interpreting the original work with the necessary squint).

19

TheInfelicitousDandy t1_is0ajet wrote

As far as I know that version doesn't give comparable PPL.

Someone else saying the same https://github.com/salesforce/awd-lstm-lm/issues/86#issuecomment-453266265

A major issue here (and for other reproductions) are people saying they have a reproduction because they can run it without errors but never actually getting the same results.

1

TheInfelicitousDandy t1_irsfw1a wrote

I've tried to reimplement AWD-LSTM in pytorch > 1. and have never been able to get close to the original results. I've also seen other people try and not get close. Pretty sure it has to do with the weight dropout they used.

If anyone knows of any pytorch > 1. version that achieves the same PPL on PTB/Wiki02 I'd very much like to know.

3

TheInfelicitousDandy t1_ir5syrj wrote

It's not really about being suitable, but rather it doesn't follow standard evaluation set-ups. In general, MLM models are trained on some version of book corpus + a wiki dump (and each model tends to use their own version which makes comparisons hard). As such, RoBERTa is really meant to be trained on much more data. That training recipe only uses Wiki103 because it is smallish and easily available. 1) By training on a smaller dataset, you risk introducing a bunch of issues like overfitting and not being hyper-parameter tuned, which for an optimization paper, are kind of important. 2) It also means I can't easily compare it to previous works despite knowing the literature well. I'm pretty sure that a PPL of 4 is really high for a bidirectional model (even if its very low when considering an autoregressive model) 3) PPL isn't even valid for a bidirectional model since it doesn't form a valid probability over a sequence.

I guess this caught my eye (and I don't mean to be rude) because it feels like something someone would do when they don't really know the language modelling literature but just want to use it as a test environment for optimization. The easy solution would just be to use the adapative-inputs LM on wiki103 which is in fairseq and a standard model/dataset combination with reproducible training scripts.

3