Viewing a single comment thread. View all comments

suflaj t1_irg1qkg wrote

> reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion".

First of, how would you know there is a max number of steps (before even knowing what they are)? There is previous work on certain architectures which can give you an idea on what a good number of steps is but:

  • there is no theoretical guarantee that is the optimum
  • there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

So this statement is ill-defined and in the best case an opinion.

> you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

Yes. And it is at the same time incomplete search and requires way more than a guess to determine how it should be done and to what extent. Generally there is a fair amount of counterexamples where we didn't know it was suboptimal until it was proven otherwise, most famously with double descent. This is something you can't just ignore and a clear and very common example where any kind of early stopping and checkpointing will fail to find even a reasonable enough configuration.

I feel like as a researcher, you shouldn't double down on things you should know are neither conclusive nor have strong fundamentals behind them. Good enough does not cut it in a general case, especially given that modern models do not even go over the whole dataset in an epoch, and as such might be broken on any one of your checkpoints.

And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

−1

aWildTinoAppears t1_irg5akq wrote

> First of, how would you know there is a max number of steps (before even knowing what they are)?

Set an arbitrarily high max steps and early stopping patience, run a hyperparamter sweep, and look at validation performance on tensorboard to make sure everything converges/overfits prior to hitting max steps or the early stopping criteria. The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

> there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

> And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

I feel like you are missing the core point, which is that early stopping+checkpoint is literally how we capture final performance. We are running 0(100) workers in parallel for a total of 0(1000) model configurations selected by vizier (ie bayesian optimization of hparam choices based on past candidate performance). I never said that models are "developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters." I said that this statement is incorrect "early stopping performance is not indicative of final performance".

2

suflaj t1_irg8l9l wrote

> The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

My point is that you CANNOT guarantee your model is done learning. Again, I will say it, please don't ignore it if you wish to discuss this further: double descent (or overparametrization side-effects in general). Also, there are training setups where you cannot even process the whole dataset and are basically gambling that the dev set you chose is as representative as whatever the model will be seeing in training. You can both overshoot and undershoot. This is not only about the number of steps, but the batch size and learning rate schedules.

> Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

I was not saying that. What I was saying that even other hyperparameters might be wildly wrong. I take it you have worked with Adam-based optimizers. They generally do not care about hyperparameters in the training period they are most effective with, but other incorrect hyperparameters might have more severe consequences you will simply not be exploring if you early stop. In the modern era, if you have a budget for hyperparameter optimization, you check for a number of steps well beyond what you intend to train, so early stopping has no place outside of very old models, 3+ eternities old. Those are nowadays a special case, given the sheer size of modern models.

> I said that this statement is incorrect "early stopping performance is not indicative of final performance".

And in doing so you have ignored a very prevalent counterexample, double descent. It is not rare (anymore), it is not made up, it is well documented, just poorly understood.

0

aWildTinoAppears t1_irj3m1p wrote

> My point is that you CANNOT guarantee your model is done learning

Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

I've ignored it because the papers you are citing aren't claiming exactly what you hope they are:

> Further, we show at least one setting where model-wise double descent can still occur even with optimal early stopping (ResNets on CIFAR-100 with no label noise, see Figure 19). *We have not observed settings where more data hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude this from occurring. We leave fully understanding the optimal early stopping behavior of double descent as an important open question for future work.*

They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

> 3+ eternities

Moving goal posts again, also dd is from eoy 2019.

> Again, I will say it, please don't ignore it if you wish to discuss this further

I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

1

suflaj t1_irj50vf wrote

> Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

Great. Now notice we are speaking of theory. In practice in DL trial and error is usually better than formally analyzing or optimizing something.

> They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

Great. One thing to notice - you are making claims that early stopping is good enough. I am making claims that because of double descent and not understanding it fully, you cannot make such claims. Those are just guesses, and not even well-informed ones.

To make such claims, the prerequisite would be to first prove (without a reasonable doubt) that your model does not exhibit overparametrization side-effects. This would mean that instead of early stopoing, you run it for way more than you intend to. THEN you can do these checkpointing optimizations, if it turns out you don't have to worry about it.

But usually it is just enough to get it working well enough instead of formally optimizing the hyperparameters, because whatever optimization you do, it cannot account for unseen data. My point is not that this is better, it's that whatever you do you are guessing, and might as well take cheaper guesses if you're not interested in it being very robust.

> Moving goal posts again, also dd is from eoy 2019.

What do you mean moving goal posts again? 3 eternities refers to 6 years ago, i.e. 2016. That is the last time models were small enough for double descent to be basically undetectable, since Attention is All You Need was released in June 2017 and worked on for quite some time then. Double descent was formally described in 2019, yes. But the phenomena it describes happened way before, and in my experience, transformers were the first to exhibit it in pretraining. Maybe it was even more than 3+ eternities ago that we had models that experienced double descent, I have not been doing DL seriously for that long.

> I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

You might have gotten the wrong person for the job, as we mostly do engineering, but love that you put in the effort to try and stalk me :)

Given that this has become personal, rather than sticking to the topic, I will not respond anymore either.

1