Submitted by begooboi t3_119zmpd in deeplearning

We know that 175 billion GPT model generates better text than 1 billion parameter GPT model. In CNN we know that deeper models will learn more complex feature maps which makes them better image learners. Is there any such theory which explains the performance of big transformers?

7

Comments

You must log in or register to comment.

Appropriate_Ant_4629 t1_j9p4z0e wrote

A bigger array holds more information than a smaller one.

^(You'd need to refine your question. It's obvious that a bigger model could outperform a smaller one -- simply by noticing that it could be identical to the smaller one by just setting the rest of it weights to zero. For every single one of those weights, if there's any value better than zero, the larger model would be better.)

8

Dropkickmurph512 t1_j9pnws1 wrote

NKT theory kinda looks into this but for more general case. The math be wilden though. Real answer is that no one knows the real reason.

1

suflaj t1_j9puhj0 wrote

Overfit on what? These models are too small to truly overfit on their datasets.

They overfit on noise, which seems to be one reason for their good performance, so it's something you want. And that is not a bad thing - learning what the noise is helps generalization. Once the model starts figuring out noise, it can go beyond what the data would usually allow it to in terms of generalization.

EDIT: Also, a larger model is more easily internally restructured. Overparametrized models are sort of like very big rooms. It's easier to rearrange the same furniture in a larger room, than it is in a smaller room.

2

artsybashev t1_j9puq9o wrote

It is in a way the same phenomena. If you think about information in images, overfitting would start to learn even the noise patterns in the images. If your training data does not have enough real information to fill the model capacity, the model will start to learn noise and overfit to your data.

3

suflaj t1_j9qectx wrote

That is peanuts for the datasets they are trained on. We're talking datasets in the order of terabytes, and the model doesn't usually even iterate over more than 10% of that. So you can't even overfit a model unless you're dealing with duplicates because you will never even go through the whole dataset.

Even if the model had 1 trillion parameters and iterated over the whole dataset, it would be too small for the number of relations contained within a dataset of 1 trillion+ bytes. AND THAT'S IF THEY WERE LINEAR, which they are (usually) NOT.

So there is large overhead in needing multiple sets of parameters to define just one type of relation. Not to mention that some of these models are trained on data pairs, which means the SQUARE of that number of relations. We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

5

OnceReturned t1_j9rb03o wrote

>We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

Solutions for what, exactly? Memorizing the entire internet (or entire training set, but still)?

1

Dropkickmurph512 t1_j9sxa1j wrote

Agree about the overparamatized models but learning the noise definitely doesn't help. It's mostly from measurements error/quantization and other stuff that is not in the vector space of the signals you care about. It is why early stopping can be useful and actually acts as a regularizer. If you want to a good example look into denoising properties of deep image prior. It can remove noise by training on a single image and stop before learning the image completely.

1

suflaj t1_j9sxn6g wrote

You say it doesn't help, yet double descent says otherwise. You do not early stop transformer models the way you do with other models, outside of maybe finetuning on a similar task.

But pretraining - no way. Big transformers are trained by setting some hyperparameters, and then checking them out the next day. If the model learned something, you keep on doing that, and if it diverged you load the last good checkpoint, change the hyperparameters and train with that.

Early stopping would imply that ypu're confident your hyperparameters are good and that you have a general idea of how long training will take and how much it can learn. For big transformers, neither is the case.

0

AnDaoLe t1_j9ul0cf wrote

There's a bunch of papers that show large neural networks are actually just memorizing data as well

1