Viewing a single comment thread. View all comments

killver t1_iz03u95 wrote

1

Visual-Arm-7375 OP t1_iz04glk wrote

1

killver t1_iz04uyh wrote

Look - I will not read now through a random blog, either you believe me and try to critically think it through or you already made up your mind anyways, then you should not have asked.

I will add a final remark.

If you make another decision (whether it generalizes well or not) on your holdout test dataset, you are basically just making another decision on it. If it does not generalize, what do you do next? You change your hyperparameters so that in works better on this test set?

What is different then vs. doing this decision on your validation data?

The terms validation and test data are mixed a lot in literature. In principle the test dataset how you define it, is just another validation dataset. And you can be more robust, by just doing multiple validation datasets, which k-fold is doing. You do not need this extra test dataset.

If you feel better doing it, go ahead. It is not "wrong" - but just not necessary and you lose train data.

1

Visual-Arm-7375 OP t1_iz065iq wrote

I don't have a clear opinion, I'm trying to learn and I'm proposing a situation and you're not listening. You are evaluating the performance of the model with the same accuracy you are selecting hyperparameters, this does not make sense.

Anyway, thank you for your help, really appreciate it.

1

killver t1_iz06mz6 wrote

Maybe that's your confusion, getting a raw accuracy score that you are communicating, vs. finding and selecting hyperparameters/models. Your original post asked about model comparison.

Anyways, I suggest you take a look at how research papers are doing it, and also browse through Kaggle solutions. Usually people are always doing local cross validation, and the actual production data is the test set (e.g. ImageNet, Kaggle Leaderboard, Business Production data, etc.).

1

rahuldave t1_iz4u6o4 wrote

Many kaggle competitions will have public and private leaderboards. And you are strongly advised to separate out your own validation set from the training data they give you to choose your best model to compare on the public leaderboard. And there are times people have fit to the public leaderboard, but this can be checked with adverserial validation and the like. If you like this kinda stuff, both Abhishek Thakur and Konrad Banachevicz's books are real nice...

0