Viewing a single comment thread. View all comments

Visual-Arm-7375 OP t1_iz2fh6e wrote

Thank you very much for the answer!

One question, what do you mean in this context by estimates? Hyperparameters?

>But if you are comparing 3-4 estimates of the error on the test set to choose the best model class this is not a large comparison, and so the test set is "not so contaminated" by this comparison, and can be used for other purposes.

Could you explain this in another way pls, I'm not sure I am understanding it :(

1

rahuldave t1_iz4tnr9 wrote

Sure! My point is that the number of comparisons you make on a set affects the amount of overfitting you will encounter. Lets look at the sets: (a) training: you are comparing ALL the model parameters from TONS of models on this set: infinite really because of the calculus driven optimization process. (b) validation: you are comparing far less here. Maybe a 10x10x10 hyper-paramer space. So the overfitting potential is less (c) test: maybe only the nest fit random forest against the best fot gradient boosting. So 2 comparisons. So less overfitting.

But how much? Well, that depends on your amount of data. The less data you have, the more likely you will overfit to a given set. This is the same reason we use cross-validation for smaller datasets, but in the neural net or recommendations space with tons of data, we only use a validation set. And these sets are huge, maybe 200000 images or similar number of data points about customers. So now you dont overfit too much even if you compared 1000 points on a hyper-parameter grid.

So the point is you will always overfit some on the validation, and extremely little on the test. If you have very little data, you want this extra test. I know, its a curse: less data and i am asking u to split it more. But think of it like this, its less data, so having less to train on means your trianing process will pick a more conservative model (less max-depth of trees for example). So its not all bad.

But if you have lots of data and a large validation set, you can be a bit of a cowboy. Pick your hyperparameters and choose the best model amongst model classes on the validation set...

2

Visual-Arm-7375 OP t1_iz5inba wrote

Thank's for the answer! I don't understand the separation you are doing btw training and validation. Didn't we have train/test and we applied cv to the train? The validation sets would be 1 fold at each cv iteration. What I am not understanding here?

1

rahuldave t1_iz5lmbz wrote

You dont always cross-validate! Yes sometimes after you do the train-test split u will use something like GridCv in sklearn to cross validate. But think of having to do 5-fold cross validation for a large NN model taking 10 days to train..you now just spent 50 days! So there you take the remaining training set after the test was left out (if u left a test out) and split into a smaller training set and a validation set.

1

Visual-Arm-7375 OP t1_iz5iyf3 wrote

And why the overfitting depends on the number of comparisons, isn't the overfitting something relation to each model separately?

1

rahuldave t1_iz5l8ee wrote

Each model can individually overfit to the training set. For example imagine 30 data points and fit via a 30th order polynomial, or anything with 30 parameters. You will overfit because you are using too complex a model. Here the overfitting is directly related to the data size, and came because you chose too complex a model.

In a sense you can think of a more complex model having more wiggles or more ways to achieve a given value. And you want to disambiguate these more complex ways from just a little data, you cant help but particularize to the data.

But the same problem happens on the validation set. Suppose I have 1000 grid points in hyperparameter space to compare. But just a little bit of data, say again 30 points. You should feel a sense of discomfort: an idiosyncratic choice of 30 points may well give you the "wrong" answer, wrong in the sense of generalizing poorly.

So the first overfitting, which we do hyper-parameter optimization on validation set to avoid happens on the train set. But the second one happens on the validation set, or any set you compare many many models on. This happens a lot on the public leaderboard in Kaggle, especially if you didnt create your own validation set in advance..

(one way, btw to think of this is that if i try enough combinations of parameters, one of them will be good on the data i have, and this is far more likely if the data is smaller, because i dont have to go through so many combinations..)

2