killver

killver t1_iz0bw96 wrote

> But because of the hyperparameter optimization on them, the actual errors (like MSE) you calculate will be too optimistic.

This is the only argument for me to have a separate test dataset that you can make a more unbiased statement regarding accuracy. But I can promise you that no practicioner or researcher will set this test dataset apart and not make a decision on it, even if only subconsciously, which again biases it.

I think the better strategy is to focus on not making too optimistic statements on k-fold validation scores such as not doing automatic early stopping, not doing automatic learning rate schedulers, etc. The goal is to always only select hyperparameters that are optimal on all folds, vs. only optimal separate per fold.

2

killver t1_iz06mz6 wrote

Maybe that's your confusion, getting a raw accuracy score that you are communicating, vs. finding and selecting hyperparameters/models. Your original post asked about model comparison.

Anyways, I suggest you take a look at how research papers are doing it, and also browse through Kaggle solutions. Usually people are always doing local cross validation, and the actual production data is the test set (e.g. ImageNet, Kaggle Leaderboard, Business Production data, etc.).

1

killver t1_iz04uyh wrote

Look - I will not read now through a random blog, either you believe me and try to critically think it through or you already made up your mind anyways, then you should not have asked.

I will add a final remark.

If you make another decision (whether it generalizes well or not) on your holdout test dataset, you are basically just making another decision on it. If it does not generalize, what do you do next? You change your hyperparameters so that in works better on this test set?

What is different then vs. doing this decision on your validation data?

The terms validation and test data are mixed a lot in literature. In principle the test dataset how you define it, is just another validation dataset. And you can be more robust, by just doing multiple validation datasets, which k-fold is doing. You do not need this extra test dataset.

If you feel better doing it, go ahead. It is not "wrong" - but just not necessary and you lose train data.

1

killver t1_iz02xc3 wrote

Other question: how can hyperparameters overfit on validation data, if it is a correct holdout set?

In your definition, if you make the decision on another local test holdout, the setting is exactly the same, no difference. And if you do not make a decision on this test dataset, why do you need it?

The important thing is that your split is not leaky and represents the unseen test data well.

1

killver t1_iz02ql9 wrote

I think you are misunderstanding it. Each validation fold is always a separate holdout dataset, so when you evaluate your model on it, you are not training on it. Why would it be a problem training on that fold for another validation holdout.

Actually your point 5 is also what you can do in the end, for production model to make use of all data.

The main goal of cross validation is to find hyperparamters that make your model generalize well.

If you take a look at papers or Kaggle, you will never find someone having both validation and test data locally. The test data usually is the real production data, or data you compare the models on. But you make decisions on your local cross validation to find a model that can generalize well on unseen test data (that is not in your current possession).

1

killver t1_iz01lwc wrote

Well, you already answered it yourself. Why would you need a separate test dataset? It is just another validation dataset, and you already have five of those in case of 5-fold cross validation.

The only important thing is that you optimize your hyperparamters so that they are best across all folds.

The real test data is your future production data, where you apply your predictions.

1

killver t1_ixiah49 wrote

Thanks a lot for all these replies. I have one more question if you do not mind: Sometimes I have huggingface models as a backbone in my model definitions, how would I go along to only apply the transformer based quantization on only the backbone? Usually these are called on the full model, but if my full model is already in onnx format it is complicated.

1

killver t1_ixhqi87 wrote

Thanks for the reply. Yeah ONNX and Openvino are already promising, but quantization on top makes the accuracy awful and actually it is even getting slower, maybe I am doing something wrong. I also had no luck with optimum library, which honestly has very bad documentation and API and is a bit too much tailored to using the transformers library out of the box.

1