suflaj t1_j9sxn6g wrote on February 24, 2023 at 9:26 AM

Reply to comment by Dropkickmurph512 in Why bigger transformer models are better learners? by begooboi

You say it doesn't help, yet double descent says otherwise. You do not early stop transformer models the way you do with other models, outside of maybe finetuning on a similar task.

But pretraining - no way. Big transformers are trained by setting some hyperparameters, and then checking them out the next day. If the model learned something, you keep on doing that, and if it diverged you load the last good checkpoint, change the hyperparameters and train with that.

Early stopping would imply that ypu're confident your hyperparameters are good and that you have a general idea of how long training will take and how much it can learn. For big transformers, neither is the case.