Dropkickmurph512 t1_j9sxa1j wrote on February 24, 2023 at 9:20 AM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

Agree about the overparamatized models but learning the noise definitely doesn't help. It's mostly from measurements error/quantization and other stuff that is not in the vector space of the signals you care about. It is why early stopping can be useful and actually acts as a regularizer. If you want to a good example look into denoising properties of deep image prior. It can remove noise by training on a single image and stop before learning the image completely.

suflaj t1_j9sxn6g wrote on February 24, 2023 at 9:26 AM

You say it doesn't help, yet double descent says otherwise. You do not early stop transformer models the way you do with other models, outside of maybe finetuning on a similar task.

But pretraining - no way. Big transformers are trained by setting some hyperparameters, and then checking them out the next day. If the model learned something, you keep on doing that, and if it diverged you load the last good checkpoint, change the hyperparameters and train with that.

Early stopping would imply that ypu're confident your hyperparameters are good and that you have a general idea of how long training will take and how much it can learn. For big transformers, neither is the case.