franztesting t1_ir0h0rh wrote on October 4, 2022 at 1:21 PM

#28,119

How much money do you have?

RandomIsAMyth t1_ir0hh8v wrote on October 4, 2022 at 1:24 PM

#28,145

Smaller networks is one way to go indeed. Have a similar architecture but smaller. Much smaller such that you can have a result in ~1h. Then you can just distribute the process using weights and biases or another similar framework.

neu_jose t1_ir0hurf wrote on October 4, 2022 at 1:27 PM

#28,174

I would tune on a smaller version of your model.

[deleted] t1_ir0n1dy wrote on October 4, 2022 at 2:06 PM

#28,445

[removed]

neato5000 t1_ir0rkr8 wrote on October 4, 2022 at 2:37 PM

#28,687

You do not need to train to completion to be able to discard hyperparameter settings that will not perform well. In general early relative performance is a good predictor of final performance, so if within the early stages of training a certain hp vector is performing worse than its peers kill it, and start training with the next hp vector.

This is roughly the logic behind population based training

twocupv60 OP t1_ir0udtf wrote on October 4, 2022 at 2:56 PM

#28,818

Replying to franztesting (#28,119)

zero dollars USD

twocupv60 OP t1_ir0uhh2 wrote on October 4, 2022 at 2:57 PM

#28,822

Replying to [deleted] (#28,445)

Your very last thought seems the most reasonable. I can't imagine shrinking the model. I would surely think this would bias the results.

[deleted] t1_ir0vqku wrote on October 4, 2022 at 3:05 PM

#28,874

Replying to twocupv60 (#28,822)

[removed]

carmichael561 t1_ir0wmis wrote on October 4, 2022 at 3:11 PM

#28,910

Replying to twocupv60 (#28,818)

That's about 4.95 CAD so just move to Canada

boggog t1_ir0xat9 wrote on October 4, 2022 at 3:15 PM

#28,949

You can try Hyperband and only go to 5 or 10 epochs and hope that for low epochs better hyperparameters already perform better. You might also try to optimize the hyperparameters on less data?

XtremePocket t1_ir0xpjw wrote on October 4, 2022 at 3:18 PM

#28,964

Mu transfer has (sort of) a theoretically guaranteed way of transferring the optimal hyperparameters of scaled down versions of a model to it. I haven’t tried it in practice, but maybe give that a try?

caedin8 t1_ir0z49w wrote on October 4, 2022 at 3:27 PM

#29,022

hyperparameter tuning should be a last step, not really necessary for 99% of production workloads, and really only for getting results publishable for papers.

I'd avoid it if possible and just go with reasonable hyperparameters. If you reach a breaking point where you can't get any better without tuning, then decide if you are trying to publish and need more accuracy, then bite the bullet and wait to publish until you finish the search, or if it is a business case, try to determine if the extra revenue from extra accuracy could offset the cost of extra compute.

ButthurtFeminists t1_ir0zlg1 wrote on October 4, 2022 at 3:30 PM

#29,042

Im surprised this one hasnt been mentioned already.

Long training could be due to model complexity and/or dataset size. Therefore, you could use a subset of your dataset if it's difficult to downscale your model. For example, let's say I'm training a Resnet152 model on ImageNet - if I wanted to reduce training time for hyperparameters, I could sample a subset of Imagenet (maybe 1/10 the size) and tune hyperparams on that, then test the best hyperparameter on the full dataset.

caedin8 t1_ir0zmpw wrote on October 4, 2022 at 3:30 PM

#29,048

Replying to caedin8 (#29,022)

I'll add the value of machine learning is the dynamic nature of the solution. In a production situation most likely, retraining quickly with weaker hyperparameters every day would lead to a higher total applied accuracy than retraining once a month with hyperparam tuning. IF the hyperparam solution is actually better, then the problem space is very static, and you might want to rethink your ML approach

twocupv60 OP t1_ir0zyni wrote on October 4, 2022 at 3:33 PM

#29,066

Replying to [deleted] (#28,874)

perceptual manifold priors using deep networks

rehrev t1_ir11ai6 wrote on October 4, 2022 at 3:41 PM

#29,123

Replying to twocupv60 (#28,822)

Yea well smaller dataset also.

What you gon do bout it

VectorSpaceModel t1_ir1310g wrote on October 4, 2022 at 3:52 PM

#29,207

See caedin8’s comments in addition to https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

techlos t1_ir131zp wrote on October 4, 2022 at 3:52 PM

#29,209

two things you can do are early stopping + using a subset of your dataset.

In my experience, hyperparams that have the best convergence at 3~5 epochs will generalize to pretty good convergence on a full training run. It won't guarantee the best performance, but if you're on a budget it's a great compromise.

twocupv60 OP t1_ir15jsz wrote on October 4, 2022 at 4:08 PM

#29,332

Replying to caedin8 (#29,022)

This isn't for a production model

HennesTD t1_ir1b0gg wrote on October 4, 2022 at 4:43 PM

#29,552

I don't quite get the idea behind training on a smaller subset of the data, although it might be just that it doesn't work in my case.

In my specific case I tried training an ASR model on Librispeech. Training it on 1/10th of the Librispeech 360h data gave me pretty much the exact same loss curve in the first hours of training. No better HP setting that I could have seen earlier. It does more epochs in that time, yes, but to see a real difference between the curves of two HP settings it took basically the same time.

SleekEagle t1_ir1bnhl wrote on October 4, 2022 at 4:47 PM

#29,594

Replying to twocupv60 (#28,818)

You have just enough to use Google Colab!

bphase t1_ir1dktk wrote on October 4, 2022 at 4:59 PM

#29,669

Replying to ButthurtFeminists (#29,042)

Wouldn't it be more beneficial to just perform 1/10 the steps or epochs? No need to use a subset of data, just train for less time. End result is you won't get the best performance anyways.

suflaj t1_ir1ejki wrote on October 4, 2022 at 5:05 PM

#29,711

You should probably try to reduce your dataset size first and then tune hyperparameters with that.

What I would do is start with randomly sampled 100 samples. Train fully with that. Then double it for the same hyperparameters and see how the performance changes. You want to stop when the performance no longer changes significantly after doubling the data.

How much is significantly? Well, I would personally stop when doubling the data doesn't halve the test error. But that criterion is arbitrary, so ymmv, and you should adjust it based on how fast it increases. Think of what performance would be acceptable for an average person who is neither stupid, nor informed enough to know your model could be much better. You just need enough data to consider your hyperparameters representative.

If you do not know how to tune that, then try clustering your data strictly. Ex., if you have text, you could divide it into 2-grams, use MinHashes and then say the threshold for a cluster is 1% similarity. This will give you very few clusters from which you can pick the representative and use that as a sample for your dev test.

Search your hyperparameters randomly within a distribution when you reach those diminishing returns and then train with those hyperparameters on the full dataset. Depending on the network the diminishing returns point will be anywhere from 1k (CV resnets) to 100k samples (finetuning transformers).

suflaj t1_ir1elya wrote on October 4, 2022 at 5:05 PM

#29,714

Replying to bphase (#29,669)

No because training steps is a hyperparameter itself.

gnarshralp t1_ir1fb0q wrote on October 4, 2022 at 5:10 PM

#29,747

Replying to franztesting (#28,119)

😂

suflaj t1_ir1fjgt wrote on October 4, 2022 at 5:11 PM

#29,758

Replying to neato5000 (#28,687)

This is not in practice true for modern DL models, especially those trained with modern optimization methods, like Adam(W). Adam(W) can have optimal performance at the start but then it's anyone's game till the end of the training.

In other words, not only will the optimal hyperparameters probably be different, because you need to change to SGD to reach max performance, you will have to retune the hyperparameters you already accepted as optimal. Successful early training only somewhat guarantees you won't diverge, but to end up with the best final weights you'll have to do additional hyperparameters search (and there is no guarantee your early training checkpoint will lead you to the best weights in the end either).

ButthurtFeminists t1_ir1h1ir wrote on October 4, 2022 at 5:21 PM

#29,829

Replying to bphase (#29,669)

This could work as well, but there may be slight differences - it's inherently harder to converge training on larger datasets. So if your goal is to see how the model performs given that you converged on the dataset, then running with fewer epochs may not be the best choice.

FinalNail t1_ir1o6p8 wrote on October 4, 2022 at 6:06 PM

#30,149

Downsample the data, and look into representative sampling or naive stratified sampling.

Ttttrrrroooowwww t1_ir1tfxo wrote on October 4, 2022 at 6:39 PM

#30,396

Miniset training. This partial dataset should somewhat reflect the mean/distribution of your actual dataset. Also, if it is very small, validation set should be a little larger.

For learning rate tune a “base learning rate” and scale it to your desired batch size using sqrt_k or linear_k rule. https://stackoverflow.com/questions/53033556/how-should-the-learning-rate-change-as-the-batch-size-change. Personally, sqrt_k rule works very well, but linear_k works too (depending on problem/model)

bbstats t1_ir1tuxd wrote on October 4, 2022 at 6:42 PM

#30,421

Replying to bphase (#29,669)

both are fine

bbstats t1_ir1u2ny wrote on October 4, 2022 at 6:43 PM

#30,431

2 solutions:

automatic resource adjustment: randomhalvingsearchcv (sklearn)
very good algo for finding best hyperparams quickly: Huawei's HEBO

The first is probably your best option

ajaysassoc t1_ir1wpul wrote on October 4, 2022 at 6:59 PM

#30,553

Replying to neu_jose (#28,174)

Oh, so we all volunteer and he can combine the results. \s

[deleted] t1_ir1xh0y wrote on October 4, 2022 at 7:04 PM

#30,592

Replying to twocupv60 (#29,066)

[removed]

there_are_no_owls t1_ir25wfi wrote on October 4, 2022 at 7:55 PM

#30,980

Replying to suflaj (#29,714)

Can you explain further?

suflaj t1_ir26f5c wrote on October 4, 2022 at 7:59 PM

#30,996

Replying to there_are_no_owls (#30,980)

What is there to explain? The statement is quite self-explanatory - by fixating the number of training steps you are not exploring the other number of training steps as hyperparameters. So it's as if you fixated any other hyperparameter to a constant, you're going to have an incomplete search.

However, a user usually has an idea of how many steps the training should take. So you don't do random or grid search on the number of steps, instead you fixate it to the number you will need to complete the training for your final result.

If you wanted to fully search for hyperparameters, then you'd also do grid search on the number of steps. This shouldn't come as a surprise when ex. the XGBoost equivalent of training steps, the number of estimators, is one of the most important hyperparameters you do search on.

Where I work at we do this as the last step, isolated from other hyperparameters. But only to find out if we need MORE training than we usually estimated. This is mostly done to account for the stochasticity of augmentations screwing up the model.

The_Bundaberg_Joey t1_ir2ca0a wrote on October 4, 2022 at 8:35 PM

#31,261

Yo! All good ideas so far but have you considered using a smaller experimental design / non grid based experimental design?

For only 2 hyper parameters you likely could get away with using fewer points and the building a model to better understand their relationship relative to your target (however you’re evaluating your model in your original grid search).

Best of luck to you!

Apprehensive-Grade81 t1_ir2uknw wrote on October 4, 2022 at 10:36 PM

#32,045

Replying to ButthurtFeminists (#29,042)

This here. Always start of with a subset of the data when you’re moving things forward (unless it’s already small enough). For some NLP datasets, just 10% is sufficient for building and assessing initial models.

Doppe1g4nger t1_ir2xuso wrote on October 4, 2022 at 11:01 PM

#32,180

Replying to suflaj (#30,996)

What’s the value in this relative to just using an early stopping criteria and using some of your dataset as a validation set to monitor for overfitting/when the model has maxed out performance?

suflaj t1_ir2yl5r wrote on October 4, 2022 at 11:06 PM

#32,211

Replying to Doppe1g4nger (#32,180)

Because early stopping performance is not indicative of final performance, even moreso when using Adam(W)

I don't know why I have to repeat this, early stopping is analogous to fixating a hyperparameter value to a constant. It doesn't matter if you stop at N steps, or at a plateau or at an accuracy threshold. You can do it, but then it's not a thorough search.

You can consider it thorough if the number of steps is comparable to the number of steps you will do to actually train the model. You can consider it thorough even if you slightly increase the number of training steps for the final model as effects related to overparametrization take a long time to converge.

As long as your training steps increase is not as long as the time it takes for side effects related to overparametrization to converge, your results will be representative of the actual final training run. If they are longer, it's again anyone's game, only this time it's potentially even more dependent on initialization than any step before (but in reality those effects are not yet understood enough to conclude aynthing relevant here).

Personally if accounting for the side effects of overparametrization, I would not do hyperparameter tuning at all - instead I would just retrain from scratch several times on "good" hyperparameters for as long as it takes and play around with weight averaging schemes.

One-Entertainment114 t1_ir31thx wrote on October 4, 2022 at 11:31 PM

#32,353

Bayesian optimization may be one useful technique

Dubgarden t1_ir33dhb wrote on October 4, 2022 at 11:43 PM

#32,435

Maybe check out the Asynchronous Successive Halving Algorithm (ASHA).

king_of_walrus t1_ir3egvh wrote on October 5, 2022 at 1:08 AM

#33,005

I have a similar problem - some of my models have taken upwards of 10 days to train! So, I have developed a strategy that is working reasonably well.

First, I work with image data and I always start by training and evaluating models at a lower resolution. For example, if I were using the CelebA-HQ dataset I would do all initial development with 128x128 images, then scale up the resolution once my results are good. Often times things translate reasonably well when scaling up and this allows for much more rapid prototyping.

Another strategy that has worked well for me is fine tuning. I train a base model with “best guess” hyperparameters to completion. Then I fine tune for a quarter of the total training time, modifying one hyperparameter of interest while keeping everything else the same. For my work, this amount of time has been enough to see the effects of the changes and to determine clear winners. In a few cases, I have been able to verify my fine tuning results by training the model from scratch under the different configurations - this is what gives me confidence in the approach. I find that this strategy still works when I have hyperparemeters which impact one another; holding one constant and optimizing the other works pretty well to balance them.

I should note that you probably don’t need to tune most hyperparameters, unless it is one you are adding. If it isn’t something novel I feel like there is bound to be a reference in the literature that has what you’re looking for. This is worth keeping in mind, I think.

Overall, it’s not really worth going to great lengths to tune things unless your results are really bad or you’re being edged out by a competitor. However, if your results are really bad that probably speaks to a larger issue.

red_dragon t1_ir3t4b6 wrote on October 5, 2022 at 3:06 AM

#33,794

Replying to suflaj (#29,758)

I'm running into this problem with Adam(W). Are there any suggestions on how to avoid this. Many of my experiments start off better than baseline, but ultimately do worse.

VirtualHat t1_ir3tiey wrote on October 5, 2022 at 3:10 AM

#33,811

Here are some options

Tune a smaller network, then apply the hyperparameters to the larger one and 'hope for the best'.
As others have said, train less, for example, 10 epochs rather than 100. I typically find this produces the wrong results though (the best performer is often poor early on)
For low dim (2d) perform a very coarse grid search (space samples an order of magnitude apart, maybe two), then use just the best model. This is often the best method as you don't want to overtune the hyperparameters.
For high dim, just use random search, then marginalize over all but one parameter using the mean of the best 5-runs. This works really well.
If the goal is often to compare two methods rather than to maximize the score, you can use other people's hyperparameters.
Baysian optimization is usually not worth the time. In small dims do grid search, in large do random search.
If you have the resources then train your models in parallel. This is a really easy way to make use of multiple GPUs if you have them.
In some cases you can perform early stopping for models which are clearly not working. I try not to do this though.
When I do HPS I'm doing it on another dataset than my main one. This helps make things quicker. I'm doing RL though, so it's a bit different I guess.

bernhard-lehner t1_ir40v91 wrote on October 5, 2022 at 4:18 AM

#34,126

Replying to twocupv60 (#28,822)

I would recommend to make sure to subsample in a way that you keep important characteristics of your data, so just randomly sampling might not be good enough.

b4shyou t1_ir47sbo wrote on October 5, 2022 at 5:34 AM

#34,401

Typically you just run the training in parallel 36 times, thats why many paper including hyperparameter tuning are from big Institutes

StephenSRMMartin t1_ir49a19 wrote on October 5, 2022 at 5:53 AM

#34,454

Noone seems to be mentioning Bayesian optimization - but I'll suggest Bayesian optimization.

Yes, you need to probably use a subsample, or a reduced model. But Bayesian optimization is a principled approach to exactly this problem.

bill_klondike t1_ir4aqt6 wrote on October 5, 2022 at 6:11 AM

#34,517

I’m using latin hypercube sampling with positive results.

oeparsons t1_ir4bv9p wrote on October 5, 2022 at 6:26 AM

#34,558

Replying to suflaj (#30,996)

>What is there to explain?

Proceeds to explain it really clearly

ginsunuva t1_ir4gyuv wrote on October 5, 2022 at 7:36 AM

#34,744

Replying to bphase (#29,669)

Ones that do well initially usually don’t correspond to those that do the best by the end.

A simple example is higher learning rates, but other parameters can affect this unexpectedly as well.

ginsunuva t1_ir4h3p4 wrote on October 5, 2022 at 7:38 AM

#34,752

Replying to neato5000 (#28,687)

A higher LR gonna have better initial performance usually

phat-gandalf t1_ir4hud3 wrote on October 5, 2022 at 7:49 AM

#34,785

Subset your data, parallelization, split tuning into multiple rounds with lower density tuning to narrow down reasonable range of values first

val_tuesday t1_ir4leqv wrote on October 5, 2022 at 8:42 AM

#34,934

Replying to twocupv60 (#28,818)

Have you tried not being poor? Seems to work wonders for Google and FB.

suflaj t1_ir4ow8t wrote on October 5, 2022 at 9:35 AM

#35,032

Replying to red_dragon (#33,794)

Switch to SGD after 1 epoch or so

But if they do worse than the baseline something else is likely the problem. Adam(W) does not kill performance, it just for some reason isn't as effective as reaching the best final performance as simpler optimizers.

SatoshiNotMe t1_ir4wbon wrote on October 5, 2022 at 11:10 AM

#35,334

Replying to neato5000 (#28,687)

Technically, what you’re talking about is early stopping of “trials” in HP tuning. PBT is different — that involves changing the hyperparameter during training. And yes you can use PBT in tuning.

encord_team t1_ir511vx wrote on October 5, 2022 at 11:59 AM

#35,542

Use Bayesian optimisation! Fit a Gaussian process to your model performance as a fn of hyperparams. Run your network on a fraction of your dataset a few times until your GP has a few samples to work on. Search hyperprams by evaluating the GP at different points.

aWildTinoAppears t1_irdceof wrote on October 7, 2022 at 4:30 AM

#50,028

Replying to suflaj (#32,211)

>early stopping performance is not indicative of final performance [...] early stopping is analogous to fixating a hyperparameter value to a constant

These statements aren't true. The whole point of checkpointing on your target validation metric and using a large enough early stopping patience is that it's a very reasonable proxy for final or peak performance. Google's Vizier and other blackbox hparam search methods are built with this as a core underlying assumption.

suflaj t1_irdvhm4 wrote on October 7, 2022 at 9:03 AM

#50,631

Replying to aWildTinoAppears (#50,028)

You should probably read up on double descent and the lottery ticket hypothesis. Google engineers have been wrong plenty of times in their "hypotheses". Furthermore, you're referring to a system from 2017, so, 2.5 eternities ago, when these phenomenon were not even known.

Also, what does reasonable mean? I would argue that it highly depends on other hyperparameters, the architecture of a model and data, and as such isn't generally applicable. It's about as reasonable as assuming 3e-4 is a good learning rate, but there are plenty of counterexamples where the network doesn't converge on it and as such cannot be considered reasonable generally.

aWildTinoAppears t1_irg015e wrote on October 7, 2022 at 8:24 PM

#53,911

Replying to suflaj (#50,631)

i'm a researcher at google ai and use vizier daily, as does everyone in brain. i've also published work extending the lottery ticket hypothesis (which is from march 2018, so, only 2 eternities ago??). optimal convergence of iteratively pruned networks happens at or before the original max number of steps, so if using best checkpointing and early stopping, the max training step will rarely be hit by lottery tickets. this doesn't support your claims at all.

reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion". you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

suflaj t1_irg1qkg wrote on October 7, 2022 at 8:38 PM

#54,000

Replying to aWildTinoAppears (#53,911)

> reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion".

First of, how would you know there is a max number of steps (before even knowing what they are)? There is previous work on certain architectures which can give you an idea on what a good number of steps is but:

there is no theoretical guarantee that is the optimum
there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

So this statement is ill-defined and in the best case an opinion.

> you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

Yes. And it is at the same time incomplete search and requires way more than a guess to determine how it should be done and to what extent. Generally there is a fair amount of counterexamples where we didn't know it was suboptimal until it was proven otherwise, most famously with double descent. This is something you can't just ignore and a clear and very common example where any kind of early stopping and checkpointing will fail to find even a reasonable enough configuration.

I feel like as a researcher, you shouldn't double down on things you should know are neither conclusive nor have strong fundamentals behind them. Good enough does not cut it in a general case, especially given that modern models do not even go over the whole dataset in an epoch, and as such might be broken on any one of your checkpoints.

And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

aWildTinoAppears t1_irg5akq wrote on October 7, 2022 at 9:06 PM

#54,136

Replying to suflaj (#54,000)

> First of, how would you know there is a max number of steps (before even knowing what they are)?

Set an arbitrarily high max steps and early stopping patience, run a hyperparamter sweep, and look at validation performance on tensorboard to make sure everything converges/overfits prior to hitting max steps or the early stopping criteria. The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

> there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

> And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

I feel like you are missing the core point, which is that early stopping+checkpoint is literally how we capture final performance. We are running 0(100) workers in parallel for a total of 0(1000) model configurations selected by vizier (ie bayesian optimization of hparam choices based on past candidate performance). I never said that models are "developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters." I said that this statement is incorrect "early stopping performance is not indicative of final performance".

suflaj t1_irg8l9l wrote on October 7, 2022 at 9:32 PM

#54,273

Replying to aWildTinoAppears (#54,136)

> The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

My point is that you CANNOT guarantee your model is done learning. Again, I will say it, please don't ignore it if you wish to discuss this further: double descent (or overparametrization side-effects in general). Also, there are training setups where you cannot even process the whole dataset and are basically gambling that the dev set you chose is as representative as whatever the model will be seeing in training. You can both overshoot and undershoot. This is not only about the number of steps, but the batch size and learning rate schedules.

> Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

I was not saying that. What I was saying that even other hyperparameters might be wildly wrong. I take it you have worked with Adam-based optimizers. They generally do not care about hyperparameters in the training period they are most effective with, but other incorrect hyperparameters might have more severe consequences you will simply not be exploring if you early stop. In the modern era, if you have a budget for hyperparameter optimization, you check for a number of steps well beyond what you intend to train, so early stopping has no place outside of very old models, 3+ eternities old. Those are nowadays a special case, given the sheer size of modern models.

> I said that this statement is incorrect "early stopping performance is not indicative of final performance".

And in doing so you have ignored a very prevalent counterexample, double descent. It is not rare (anymore), it is not made up, it is well documented, just poorly understood.

aWildTinoAppears t1_irj3m1p wrote on October 8, 2022 at 4:36 PM

#58,639

Replying to suflaj (#54,273)

> My point is that you CANNOT guarantee your model is done learning

Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

I've ignored it because the papers you are citing aren't claiming exactly what you hope they are:

> Further, we show at least one setting where model-wise double descent can still occur even with optimal early stopping (ResNets on CIFAR-100 with no label noise, see Figure 19). *We have not observed settings where more data hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude this from occurring. We leave fully understanding the optimal early stopping behavior of double descent as an important open question for future work.*

They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

> 3+ eternities

Moving goal posts again, also dd is from eoy 2019.

> Again, I will say it, please don't ignore it if you wish to discuss this further

I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

suflaj t1_irj50vf wrote on October 8, 2022 at 4:47 PM

#58,692

Replying to aWildTinoAppears (#58,639)

> Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

Great. Now notice we are speaking of theory. In practice in DL trial and error is usually better than formally analyzing or optimizing something.

> They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

Great. One thing to notice - you are making claims that early stopping is good enough. I am making claims that because of double descent and not understanding it fully, you cannot make such claims. Those are just guesses, and not even well-informed ones.

To make such claims, the prerequisite would be to first prove (without a reasonable doubt) that your model does not exhibit overparametrization side-effects. This would mean that instead of early stopoing, you run it for way more than you intend to. THEN you can do these checkpointing optimizations, if it turns out you don't have to worry about it.

But usually it is just enough to get it working well enough instead of formally optimizing the hyperparameters, because whatever optimization you do, it cannot account for unseen data. My point is not that this is better, it's that whatever you do you are guessing, and might as well take cheaper guesses if you're not interested in it being very robust.

> Moving goal posts again, also dd is from eoy 2019.

What do you mean moving goal posts again? 3 eternities refers to 6 years ago, i.e. 2016. That is the last time models were small enough for double descent to be basically undetectable, since Attention is All You Need was released in June 2017 and worked on for quite some time then. Double descent was formally described in 2019, yes. But the phenomena it describes happened way before, and in my experience, transformers were the first to exhibit it in pretraining. Maybe it was even more than 3+ eternities ago that we had models that experienced double descent, I have not been doing DL seriously for that long.

> I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

You might have gotten the wrong person for the job, as we mostly do engineering, but love that you put in the effort to try and stalk me :)

Given that this has become personal, rather than sticking to the topic, I will not respond anymore either.

[deleted] t1_itziw2x wrote on October 27, 2022 at 1:30 PM

#257,863

[removed]

Comments