Submitted by begooboi t3_119zmpd in deeplearning
suflaj t1_j9puhj0 wrote
Reply to comment by levand in Why bigger transformer models are better learners? by begooboi
Overfit on what? These models are too small to truly overfit on their datasets.
They overfit on noise, which seems to be one reason for their good performance, so it's something you want. And that is not a bad thing - learning what the noise is helps generalization. Once the model starts figuring out noise, it can go beyond what the data would usually allow it to in terms of generalization.
EDIT: Also, a larger model is more easily internally restructured. Overparametrized models are sort of like very big rooms. It's easier to rearrange the same furniture in a larger room, than it is in a smaller room.
levand t1_j9qe7ev wrote
> These models are too small to truly overfit on their datasets.
I thought we were talking about 175 billion parameters, literally some of the biggest models in existence? Although it is true that at some point models get big enough that they become less prone to overfitting (and it's not clear why): https://openai.com/blog/deep-double-descent/
suflaj t1_j9qectx wrote
That is peanuts for the datasets they are trained on. We're talking datasets in the order of terabytes, and the model doesn't usually even iterate over more than 10% of that. So you can't even overfit a model unless you're dealing with duplicates because you will never even go through the whole dataset.
Even if the model had 1 trillion parameters and iterated over the whole dataset, it would be too small for the number of relations contained within a dataset of 1 trillion+ bytes. AND THAT'S IF THEY WERE LINEAR, which they are (usually) NOT.
So there is large overhead in needing multiple sets of parameters to define just one type of relation. Not to mention that some of these models are trained on data pairs, which means the SQUARE of that number of relations. We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.
OnceReturned t1_j9rb03o wrote
>We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.
Solutions for what, exactly? Memorizing the entire internet (or entire training set, but still)?
Dropkickmurph512 t1_j9sxa1j wrote
Agree about the overparamatized models but learning the noise definitely doesn't help. It's mostly from measurements error/quantization and other stuff that is not in the vector space of the signals you care about. It is why early stopping can be useful and actually acts as a regularizer. If you want to a good example look into denoising properties of deep image prior. It can remove noise by training on a single image and stop before learning the image completely.
suflaj t1_j9sxn6g wrote
You say it doesn't help, yet double descent says otherwise. You do not early stop transformer models the way you do with other models, outside of maybe finetuning on a similar task.
But pretraining - no way. Big transformers are trained by setting some hyperparameters, and then checking them out the next day. If the model learned something, you keep on doing that, and if it diverged you load the last good checkpoint, change the hyperparameters and train with that.
Early stopping would imply that ypu're confident your hyperparameters are good and that you have a general idea of how long training will take and how much it can learn. For big transformers, neither is the case.
junetwentyfirst2020 t1_j9rghm4 wrote
You’re being very loose with the word noise here.
suflaj t1_j9swm0z wrote
I'm not sure what you mean. I'm using the usual definition of noise.
Viewing a single comment thread. View all comments