aWildTinoAppears

aWildTinoAppears t1_irj3m1p wrote

> My point is that you CANNOT guarantee your model is done learning

Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

I've ignored it because the papers you are citing aren't claiming exactly what you hope they are:

> Further, we show at least one setting where model-wise double descent can still occur even with optimal early stopping (ResNets on CIFAR-100 with no label noise, see Figure 19). *We have not observed settings where more data hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude this from occurring. We leave fully understanding the optimal early stopping behavior of double descent as an important open question for future work.*

They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

> 3+ eternities

Moving goal posts again, also dd is from eoy 2019.

> Again, I will say it, please don't ignore it if you wish to discuss this further

I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

1

aWildTinoAppears t1_irg5akq wrote

> First of, how would you know there is a max number of steps (before even knowing what they are)?

Set an arbitrarily high max steps and early stopping patience, run a hyperparamter sweep, and look at validation performance on tensorboard to make sure everything converges/overfits prior to hitting max steps or the early stopping criteria. The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

> there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

> And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

I feel like you are missing the core point, which is that early stopping+checkpoint is literally how we capture final performance. We are running 0(100) workers in parallel for a total of 0(1000) model configurations selected by vizier (ie bayesian optimization of hparam choices based on past candidate performance). I never said that models are "developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters." I said that this statement is incorrect "early stopping performance is not indicative of final performance".

2

aWildTinoAppears t1_irg015e wrote

i'm a researcher at google ai and use vizier daily, as does everyone in brain. i've also published work extending the lottery ticket hypothesis (which is from march 2018, so, only 2 eternities ago??). optimal convergence of iteratively pruned networks happens at or before the original max number of steps, so if using best checkpointing and early stopping, the max training step will rarely be hit by lottery tickets. this doesn't support your claims at all.

reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion". you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

0

aWildTinoAppears t1_irdceof wrote

>early stopping performance is not indicative of final performance [...] early stopping is analogous to fixating a hyperparameter value to a constant

These statements aren't true. The whole point of checkpointing on your target validation metric and using a large enough early stopping patience is that it's a very reasonable proxy for final or peak performance. Google's Vizier and other blackbox hparam search methods are built with this as a core underlying assumption.

1