levand t1_j9qe7ev wrote on February 23, 2023 at 8:45 PM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

> These models are too small to truly overfit on their datasets.

I thought we were talking about 175 billion parameters, literally some of the biggest models in existence? Although it is true that at some point models get big enough that they become less prone to overfitting (and it's not clear why): https://openai.com/blog/deep-double-descent/

suflaj t1_j9qectx wrote on February 23, 2023 at 8:46 PM

That is peanuts for the datasets they are trained on. We're talking datasets in the order of terabytes, and the model doesn't usually even iterate over more than 10% of that. So you can't even overfit a model unless you're dealing with duplicates because you will never even go through the whole dataset.

Even if the model had 1 trillion parameters and iterated over the whole dataset, it would be too small for the number of relations contained within a dataset of 1 trillion+ bytes. AND THAT'S IF THEY WERE LINEAR, which they are (usually) NOT.

So there is large overhead in needing multiple sets of parameters to define just one type of relation. Not to mention that some of these models are trained on data pairs, which means the SQUARE of that number of relations. We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

OnceReturned t1_j9rb03o wrote on February 24, 2023 at 12:18 AM

>We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

Solutions for what, exactly? Memorizing the entire internet (or entire training set, but still)?