Viewing a single comment thread. View all comments

rahuldave t1_iz5l8ee wrote

Each model can individually overfit to the training set. For example imagine 30 data points and fit via a 30th order polynomial, or anything with 30 parameters. You will overfit because you are using too complex a model. Here the overfitting is directly related to the data size, and came because you chose too complex a model.

In a sense you can think of a more complex model having more wiggles or more ways to achieve a given value. And you want to disambiguate these more complex ways from just a little data, you cant help but particularize to the data.

But the same problem happens on the validation set. Suppose I have 1000 grid points in hyperparameter space to compare. But just a little bit of data, say again 30 points. You should feel a sense of discomfort: an idiosyncratic choice of 30 points may well give you the "wrong" answer, wrong in the sense of generalizing poorly.

So the first overfitting, which we do hyper-parameter optimization on validation set to avoid happens on the train set. But the second one happens on the validation set, or any set you compare many many models on. This happens a lot on the public leaderboard in Kaggle, especially if you didnt create your own validation set in advance..

(one way, btw to think of this is that if i try enough combinations of parameters, one of them will be good on the data i have, and this is far more likely if the data is smaller, because i dont have to go through so many combinations..)

2