suflaj t1_ir2yl5r wrote on October 4, 2022 at 11:06 PM

Because early stopping performance is not indicative of final performance, even moreso when using Adam(W)

I don't know why I have to repeat this, early stopping is analogous to fixating a hyperparameter value to a constant. It doesn't matter if you stop at N steps, or at a plateau or at an accuracy threshold. You can do it, but then it's not a thorough search.

You can consider it thorough if the number of steps is comparable to the number of steps you will do to actually train the model. You can consider it thorough even if you slightly increase the number of training steps for the final model as effects related to overparametrization take a long time to converge.

As long as your training steps increase is not as long as the time it takes for side effects related to overparametrization to converge, your results will be representative of the actual final training run. If they are longer, it's again anyone's game, only this time it's potentially even more dependent on initialization than any step before (but in reality those effects are not yet understood enough to conclude aynthing relevant here).

Personally if accounting for the side effects of overparametrization, I would not do hyperparameter tuning at all - instead I would just retrain from scratch several times on "good" hyperparameters for as long as it takes and play around with weight averaging schemes.

aWildTinoAppears t1_irdceof wrote on October 7, 2022 at 4:30 AM

>early stopping performance is not indicative of final performance [...] early stopping is analogous to fixating a hyperparameter value to a constant

These statements aren't true. The whole point of checkpointing on your target validation metric and using a large enough early stopping patience is that it's a very reasonable proxy for final or peak performance. Google's Vizier and other blackbox hparam search methods are built with this as a core underlying assumption.

suflaj t1_irdvhm4 wrote on October 7, 2022 at 9:03 AM

You should probably read up on double descent and the lottery ticket hypothesis. Google engineers have been wrong plenty of times in their "hypotheses". Furthermore, you're referring to a system from 2017, so, 2.5 eternities ago, when these phenomenon were not even known.

Also, what does reasonable mean? I would argue that it highly depends on other hyperparameters, the architecture of a model and data, and as such isn't generally applicable. It's about as reasonable as assuming 3e-4 is a good learning rate, but there are plenty of counterexamples where the network doesn't converge on it and as such cannot be considered reasonable generally.

aWildTinoAppears t1_irg015e wrote on October 7, 2022 at 8:24 PM

i'm a researcher at google ai and use vizier daily, as does everyone in brain. i've also published work extending the lottery ticket hypothesis (which is from march 2018, so, only 2 eternities ago??). optimal convergence of iteratively pruned networks happens at or before the original max number of steps, so if using best checkpointing and early stopping, the max training step will rarely be hit by lottery tickets. this doesn't support your claims at all.

reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion". you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

suflaj t1_irg1qkg wrote on October 7, 2022 at 8:38 PM

> reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion".

First of, how would you know there is a max number of steps (before even knowing what they are)? There is previous work on certain architectures which can give you an idea on what a good number of steps is but:

there is no theoretical guarantee that is the optimum
there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

So this statement is ill-defined and in the best case an opinion.

> you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

Yes. And it is at the same time incomplete search and requires way more than a guess to determine how it should be done and to what extent. Generally there is a fair amount of counterexamples where we didn't know it was suboptimal until it was proven otherwise, most famously with double descent. This is something you can't just ignore and a clear and very common example where any kind of early stopping and checkpointing will fail to find even a reasonable enough configuration.

I feel like as a researcher, you shouldn't double down on things you should know are neither conclusive nor have strong fundamentals behind them. Good enough does not cut it in a general case, especially given that modern models do not even go over the whole dataset in an epoch, and as such might be broken on any one of your checkpoints.

And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

aWildTinoAppears t1_irg5akq wrote on October 7, 2022 at 9:06 PM

> First of, how would you know there is a max number of steps (before even knowing what they are)?

Set an arbitrarily high max steps and early stopping patience, run a hyperparamter sweep, and look at validation performance on tensorboard to make sure everything converges/overfits prior to hitting max steps or the early stopping criteria. The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

> there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

> And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

I feel like you are missing the core point, which is that early stopping+checkpoint is literally how we capture final performance. We are running 0(100) workers in parallel for a total of 0(1000) model configurations selected by vizier (ie bayesian optimization of hparam choices based on past candidate performance). I never said that models are "developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters." I said that this statement is incorrect "early stopping performance is not indicative of final performance".

suflaj t1_irg8l9l wrote on October 7, 2022 at 9:32 PM

> The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

My point is that you CANNOT guarantee your model is done learning. Again, I will say it, please don't ignore it if you wish to discuss this further: double descent (or overparametrization side-effects in general). Also, there are training setups where you cannot even process the whole dataset and are basically gambling that the dev set you chose is as representative as whatever the model will be seeing in training. You can both overshoot and undershoot. This is not only about the number of steps, but the batch size and learning rate schedules.

> Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

I was not saying that. What I was saying that even other hyperparameters might be wildly wrong. I take it you have worked with Adam-based optimizers. They generally do not care about hyperparameters in the training period they are most effective with, but other incorrect hyperparameters might have more severe consequences you will simply not be exploring if you early stop. In the modern era, if you have a budget for hyperparameter optimization, you check for a number of steps well beyond what you intend to train, so early stopping has no place outside of very old models, 3+ eternities old. Those are nowadays a special case, given the sheer size of modern models.

> I said that this statement is incorrect "early stopping performance is not indicative of final performance".

And in doing so you have ignored a very prevalent counterexample, double descent. It is not rare (anymore), it is not made up, it is well documented, just poorly understood.

aWildTinoAppears t1_irj3m1p wrote on October 8, 2022 at 4:36 PM

> My point is that you CANNOT guarantee your model is done learning

Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

I've ignored it because the papers you are citing aren't claiming exactly what you hope they are:

> Further, we show at least one setting where model-wise double descent can still occur even with optimal early stopping (ResNets on CIFAR-100 with no label noise, see Figure 19). *We have not observed settings where more data hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude this from occurring. We leave fully understanding the optimal early stopping behavior of double descent as an important open question for future work.*

They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

> 3+ eternities

Moving goal posts again, also dd is from eoy 2019.

> Again, I will say it, please don't ignore it if you wish to discuss this further

I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

suflaj t1_irj50vf wrote on October 8, 2022 at 4:47 PM

> Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

Great. Now notice we are speaking of theory. In practice in DL trial and error is usually better than formally analyzing or optimizing something.

> They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

Great. One thing to notice - you are making claims that early stopping is good enough. I am making claims that because of double descent and not understanding it fully, you cannot make such claims. Those are just guesses, and not even well-informed ones.

To make such claims, the prerequisite would be to first prove (without a reasonable doubt) that your model does not exhibit overparametrization side-effects. This would mean that instead of early stopoing, you run it for way more than you intend to. THEN you can do these checkpointing optimizations, if it turns out you don't have to worry about it.

But usually it is just enough to get it working well enough instead of formally optimizing the hyperparameters, because whatever optimization you do, it cannot account for unseen data. My point is not that this is better, it's that whatever you do you are guessing, and might as well take cheaper guesses if you're not interested in it being very robust.

> Moving goal posts again, also dd is from eoy 2019.

What do you mean moving goal posts again? 3 eternities refers to 6 years ago, i.e. 2016. That is the last time models were small enough for double descent to be basically undetectable, since Attention is All You Need was released in June 2017 and worked on for quite some time then. Double descent was formally described in 2019, yes. But the phenomena it describes happened way before, and in my experience, transformers were the first to exhibit it in pretraining. Maybe it was even more than 3+ eternities ago that we had models that experienced double descent, I have not been doing DL seriously for that long.

> I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

You might have gotten the wrong person for the job, as we mostly do engineering, but love that you put in the effort to try and stalk me :)

Given that this has become personal, rather than sticking to the topic, I will not respond anymore either.

[D] How do you go about hyperparameter tuning when network takes a long time to train?

suflaj t1_ir26f5c wrote on October 4, 2022 at 7:59 PM

oeparsons t1_ir4bv9p wrote on October 5, 2022 at 6:26 AM

Doppe1g4nger t1_ir2xuso wrote on October 4, 2022 at 11:01 PM