Submitted by twocupv60 t3_xvem36 in MachineLearning

My network takes about 24 hours to train. I have 2 hyperparameters to tune and assuming each parameter could take on roughly 6 orders of magnitude, then I would have to run my network 36 times to find the best hyperparameters given this grid search. This would take me over a month to perform! This seems quite long.

I see a lot of papers doing hyperparameter tuning. Do they have smaller networks that can train faster? Is some trick used to speed up the search process?

87

Comments

You must log in or register to comment.

RandomIsAMyth t1_ir0hh8v wrote

Smaller networks is one way to go indeed. Have a similar architecture but smaller. Much smaller such that you can have a result in ~1h. Then you can just distribute the process using weights and biases or another similar framework.

2

neu_jose t1_ir0hurf wrote

I would tune on a smaller version of your model.

8

neato5000 t1_ir0rkr8 wrote

You do not need to train to completion to be able to discard hyperparameter settings that will not perform well. In general early relative performance is a good predictor of final performance, so if within the early stages of training a certain hp vector is performing worse than its peers kill it, and start training with the next hp vector.

This is roughly the logic behind population based training

52

boggog t1_ir0xat9 wrote

You can try Hyperband and only go to 5 or 10 epochs and hope that for low epochs better hyperparameters already perform better. You might also try to optimize the hyperparameters on less data?

9

XtremePocket t1_ir0xpjw wrote

Mu transfer has (sort of) a theoretically guaranteed way of transferring the optimal hyperparameters of scaled down versions of a model to it. I haven’t tried it in practice, but maybe give that a try?

3

caedin8 t1_ir0z49w wrote

hyperparameter tuning should be a last step, not really necessary for 99% of production workloads, and really only for getting results publishable for papers.

I'd avoid it if possible and just go with reasonable hyperparameters. If you reach a breaking point where you can't get any better without tuning, then decide if you are trying to publish and need more accuracy, then bite the bullet and wait to publish until you finish the search, or if it is a business case, try to determine if the extra revenue from extra accuracy could offset the cost of extra compute.

3

ButthurtFeminists t1_ir0zlg1 wrote

Im surprised this one hasnt been mentioned already.

Long training could be due to model complexity and/or dataset size. Therefore, you could use a subset of your dataset if it's difficult to downscale your model. For example, let's say I'm training a Resnet152 model on ImageNet - if I wanted to reduce training time for hyperparameters, I could sample a subset of Imagenet (maybe 1/10 the size) and tune hyperparams on that, then test the best hyperparameter on the full dataset.

101

caedin8 t1_ir0zmpw wrote

I'll add the value of machine learning is the dynamic nature of the solution. In a production situation most likely, retraining quickly with weaker hyperparameters every day would lead to a higher total applied accuracy than retraining once a month with hyperparam tuning. IF the hyperparam solution is actually better, then the problem space is very static, and you might want to rethink your ML approach

4

techlos t1_ir131zp wrote

two things you can do are early stopping + using a subset of your dataset.

In my experience, hyperparams that have the best convergence at 3~5 epochs will generalize to pretty good convergence on a full training run. It won't guarantee the best performance, but if you're on a budget it's a great compromise.

14

HennesTD t1_ir1b0gg wrote

I don't quite get the idea behind training on a smaller subset of the data, although it might be just that it doesn't work in my case.

In my specific case I tried training an ASR model on Librispeech. Training it on 1/10th of the Librispeech 360h data gave me pretty much the exact same loss curve in the first hours of training. No better HP setting that I could have seen earlier. It does more epochs in that time, yes, but to see a real difference between the curves of two HP settings it took basically the same time.

2

bphase t1_ir1dktk wrote

Wouldn't it be more beneficial to just perform 1/10 the steps or epochs? No need to use a subset of data, just train for less time. End result is you won't get the best performance anyways.

14

suflaj t1_ir1ejki wrote

You should probably try to reduce your dataset size first and then tune hyperparameters with that.

What I would do is start with randomly sampled 100 samples. Train fully with that. Then double it for the same hyperparameters and see how the performance changes. You want to stop when the performance no longer changes significantly after doubling the data.

How much is significantly? Well, I would personally stop when doubling the data doesn't halve the test error. But that criterion is arbitrary, so ymmv, and you should adjust it based on how fast it increases. Think of what performance would be acceptable for an average person who is neither stupid, nor informed enough to know your model could be much better. You just need enough data to consider your hyperparameters representative.

If you do not know how to tune that, then try clustering your data strictly. Ex., if you have text, you could divide it into 2-grams, use MinHashes and then say the threshold for a cluster is 1% similarity. This will give you very few clusters from which you can pick the representative and use that as a sample for your dev test.

Search your hyperparameters randomly within a distribution when you reach those diminishing returns and then train with those hyperparameters on the full dataset. Depending on the network the diminishing returns point will be anywhere from 1k (CV resnets) to 100k samples (finetuning transformers).

9

suflaj t1_ir1fjgt wrote

This is not in practice true for modern DL models, especially those trained with modern optimization methods, like Adam(W). Adam(W) can have optimal performance at the start but then it's anyone's game till the end of the training.

In other words, not only will the optimal hyperparameters probably be different, because you need to change to SGD to reach max performance, you will have to retune the hyperparameters you already accepted as optimal. Successful early training only somewhat guarantees you won't diverge, but to end up with the best final weights you'll have to do additional hyperparameters search (and there is no guarantee your early training checkpoint will lead you to the best weights in the end either).

20

ButthurtFeminists t1_ir1h1ir wrote

This could work as well, but there may be slight differences - it's inherently harder to converge training on larger datasets. So if your goal is to see how the model performs given that you converged on the dataset, then running with fewer epochs may not be the best choice.

16

FinalNail t1_ir1o6p8 wrote

Downsample the data, and look into representative sampling or naive stratified sampling.

3

Ttttrrrroooowwww t1_ir1tfxo wrote

Miniset training. This partial dataset should somewhat reflect the mean/distribution of your actual dataset. Also, if it is very small, validation set should be a little larger.

For learning rate tune a “base learning rate” and scale it to your desired batch size using sqrt_k or linear_k rule. https://stackoverflow.com/questions/53033556/how-should-the-learning-rate-change-as-the-batch-size-change. Personally, sqrt_k rule works very well, but linear_k works too (depending on problem/model)

1

bbstats t1_ir1u2ny wrote

2 solutions:

  • automatic resource adjustment: randomhalvingsearchcv (sklearn)
  • very good algo for finding best hyperparams quickly: Huawei's HEBO

The first is probably your best option

2

suflaj t1_ir26f5c wrote

What is there to explain? The statement is quite self-explanatory - by fixating the number of training steps you are not exploring the other number of training steps as hyperparameters. So it's as if you fixated any other hyperparameter to a constant, you're going to have an incomplete search.

However, a user usually has an idea of how many steps the training should take. So you don't do random or grid search on the number of steps, instead you fixate it to the number you will need to complete the training for your final result.

If you wanted to fully search for hyperparameters, then you'd also do grid search on the number of steps. This shouldn't come as a surprise when ex. the XGBoost equivalent of training steps, the number of estimators, is one of the most important hyperparameters you do search on.

Where I work at we do this as the last step, isolated from other hyperparameters. But only to find out if we need MORE training than we usually estimated. This is mostly done to account for the stochasticity of augmentations screwing up the model.

5

The_Bundaberg_Joey t1_ir2ca0a wrote

Yo! All good ideas so far but have you considered using a smaller experimental design / non grid based experimental design?

For only 2 hyper parameters you likely could get away with using fewer points and the building a model to better understand their relationship relative to your target (however you’re evaluating your model in your original grid search).

Best of luck to you!

1

Doppe1g4nger t1_ir2xuso wrote

What’s the value in this relative to just using an early stopping criteria and using some of your dataset as a validation set to monitor for overfitting/when the model has maxed out performance?

3

suflaj t1_ir2yl5r wrote

Because early stopping performance is not indicative of final performance, even moreso when using Adam(W)

I don't know why I have to repeat this, early stopping is analogous to fixating a hyperparameter value to a constant. It doesn't matter if you stop at N steps, or at a plateau or at an accuracy threshold. You can do it, but then it's not a thorough search.

You can consider it thorough if the number of steps is comparable to the number of steps you will do to actually train the model. You can consider it thorough even if you slightly increase the number of training steps for the final model as effects related to overparametrization take a long time to converge.

As long as your training steps increase is not as long as the time it takes for side effects related to overparametrization to converge, your results will be representative of the actual final training run. If they are longer, it's again anyone's game, only this time it's potentially even more dependent on initialization than any step before (but in reality those effects are not yet understood enough to conclude aynthing relevant here).

Personally if accounting for the side effects of overparametrization, I would not do hyperparameter tuning at all - instead I would just retrain from scratch several times on "good" hyperparameters for as long as it takes and play around with weight averaging schemes.

1

Dubgarden t1_ir33dhb wrote

Maybe check out the Asynchronous Successive Halving Algorithm (ASHA).

1

king_of_walrus t1_ir3egvh wrote

I have a similar problem - some of my models have taken upwards of 10 days to train! So, I have developed a strategy that is working reasonably well.

First, I work with image data and I always start by training and evaluating models at a lower resolution. For example, if I were using the CelebA-HQ dataset I would do all initial development with 128x128 images, then scale up the resolution once my results are good. Often times things translate reasonably well when scaling up and this allows for much more rapid prototyping.

Another strategy that has worked well for me is fine tuning. I train a base model with “best guess” hyperparameters to completion. Then I fine tune for a quarter of the total training time, modifying one hyperparameter of interest while keeping everything else the same. For my work, this amount of time has been enough to see the effects of the changes and to determine clear winners. In a few cases, I have been able to verify my fine tuning results by training the model from scratch under the different configurations - this is what gives me confidence in the approach. I find that this strategy still works when I have hyperparemeters which impact one another; holding one constant and optimizing the other works pretty well to balance them.

I should note that you probably don’t need to tune most hyperparameters, unless it is one you are adding. If it isn’t something novel I feel like there is bound to be a reference in the literature that has what you’re looking for. This is worth keeping in mind, I think.

Overall, it’s not really worth going to great lengths to tune things unless your results are really bad or you’re being edged out by a competitor. However, if your results are really bad that probably speaks to a larger issue.

2

red_dragon t1_ir3t4b6 wrote

I'm running into this problem with Adam(W). Are there any suggestions on how to avoid this. Many of my experiments start off better than baseline, but ultimately do worse.

1

VirtualHat t1_ir3tiey wrote

Here are some options

  1. Tune a smaller network, then apply the hyperparameters to the larger one and 'hope for the best'.
  2. As others have said, train less, for example, 10 epochs rather than 100. I typically find this produces the wrong results though (the best performer is often poor early on)
  3. For low dim (2d) perform a very coarse grid search (space samples an order of magnitude apart, maybe two), then use just the best model. This is often the best method as you don't want to overtune the hyperparameters.
  4. For high dim, just use random search, then marginalize over all but one parameter using the mean of the best 5-runs. This works really well.
  5. If the goal is often to compare two methods rather than to maximize the score, you can use other people's hyperparameters.
  6. Baysian optimization is usually not worth the time. In small dims do grid search, in large do random search.
  7. If you have the resources then train your models in parallel. This is a really easy way to make use of multiple GPUs if you have them.
  8. In some cases you can perform early stopping for models which are clearly not working. I try not to do this though.
  9. When I do HPS I'm doing it on another dataset than my main one. This helps make things quicker. I'm doing RL though, so it's a bit different I guess.
1

b4shyou t1_ir47sbo wrote

Typically you just run the training in parallel 36 times, thats why many paper including hyperparameter tuning are from big Institutes

1

StephenSRMMartin t1_ir49a19 wrote

Noone seems to be mentioning Bayesian optimization - but I'll suggest Bayesian optimization.

Yes, you need to probably use a subsample, or a reduced model. But Bayesian optimization is a principled approach to exactly this problem.

1

bill_klondike t1_ir4aqt6 wrote

I’m using latin hypercube sampling with positive results.

1

ginsunuva t1_ir4gyuv wrote

Ones that do well initially usually don’t correspond to those that do the best by the end.

A simple example is higher learning rates, but other parameters can affect this unexpectedly as well.

1

phat-gandalf t1_ir4hud3 wrote

Subset your data, parallelization, split tuning into multiple rounds with lower density tuning to narrow down reasonable range of values first

1

suflaj t1_ir4ow8t wrote

Switch to SGD after 1 epoch or so

But if they do worse than the baseline something else is likely the problem. Adam(W) does not kill performance, it just for some reason isn't as effective as reaching the best final performance as simpler optimizers.

0

SatoshiNotMe t1_ir4wbon wrote

Technically, what you’re talking about is early stopping of “trials” in HP tuning. PBT is different — that involves changing the hyperparameter during training. And yes you can use PBT in tuning.

1

encord_team t1_ir511vx wrote

Use Bayesian optimisation! Fit a Gaussian process to your model performance as a fn of hyperparams. Run your network on a fraction of your dataset a few times until your GP has a few samples to work on. Search hyperprams by evaluating the GP at different points.

1

aWildTinoAppears t1_irdceof wrote

>early stopping performance is not indicative of final performance [...] early stopping is analogous to fixating a hyperparameter value to a constant

These statements aren't true. The whole point of checkpointing on your target validation metric and using a large enough early stopping patience is that it's a very reasonable proxy for final or peak performance. Google's Vizier and other blackbox hparam search methods are built with this as a core underlying assumption.

1

suflaj t1_irdvhm4 wrote

You should probably read up on double descent and the lottery ticket hypothesis. Google engineers have been wrong plenty of times in their "hypotheses". Furthermore, you're referring to a system from 2017, so, 2.5 eternities ago, when these phenomenon were not even known.

Also, what does reasonable mean? I would argue that it highly depends on other hyperparameters, the architecture of a model and data, and as such isn't generally applicable. It's about as reasonable as assuming 3e-4 is a good learning rate, but there are plenty of counterexamples where the network doesn't converge on it and as such cannot be considered reasonable generally.

0

aWildTinoAppears t1_irg015e wrote

i'm a researcher at google ai and use vizier daily, as does everyone in brain. i've also published work extending the lottery ticket hypothesis (which is from march 2018, so, only 2 eternities ago??). optimal convergence of iteratively pruned networks happens at or before the original max number of steps, so if using best checkpointing and early stopping, the max training step will rarely be hit by lottery tickets. this doesn't support your claims at all.

reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion". you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

0

suflaj t1_irg1qkg wrote

> reasonable here means "good enough relative to a network that trains for the max number of steps without hitting early stopping criterion".

First of, how would you know there is a max number of steps (before even knowing what they are)? There is previous work on certain architectures which can give you an idea on what a good number of steps is but:

  • there is no theoretical guarantee that is the optimum
  • there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

So this statement is ill-defined and in the best case an opinion.

> you realize the entire point of early stopping + best checkpointing is to help prune the hparam search space so that you can focus on more impactful parameters like batch size, learning rate, etc, right?

Yes. And it is at the same time incomplete search and requires way more than a guess to determine how it should be done and to what extent. Generally there is a fair amount of counterexamples where we didn't know it was suboptimal until it was proven otherwise, most famously with double descent. This is something you can't just ignore and a clear and very common example where any kind of early stopping and checkpointing will fail to find even a reasonable enough configuration.

I feel like as a researcher, you shouldn't double down on things you should know are neither conclusive nor have strong fundamentals behind them. Good enough does not cut it in a general case, especially given that modern models do not even go over the whole dataset in an epoch, and as such might be broken on any one of your checkpoints.

And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

−1

aWildTinoAppears t1_irg5akq wrote

> First of, how would you know there is a max number of steps (before even knowing what they are)?

Set an arbitrarily high max steps and early stopping patience, run a hyperparamter sweep, and look at validation performance on tensorboard to make sure everything converges/overfits prior to hitting max steps or the early stopping criteria. The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

> there is no theoretical guarantee that the hyperparameters explored and finally used are the optimum

Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

> And given that you say you work for Google, maybe you shouldn't pretend that most SOTA models nowadays aren't developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters.

I feel like you are missing the core point, which is that early stopping+checkpoint is literally how we capture final performance. We are running 0(100) workers in parallel for a total of 0(1000) model configurations selected by vizier (ie bayesian optimization of hparam choices based on past candidate performance). I never said that models are "developed by simply backtracking and making an educated guess on the hyperparameters, rather than thoroughly exploring hyperparameters." I said that this statement is incorrect "early stopping performance is not indicative of final performance".

2

suflaj t1_irg8l9l wrote

> The point here is that max training steps does not need to be a tuned hyperparamter under this experimental setup--you allow models to train until convergence and stop them once they are clearly overfitting. In this scenario, final performance is always strictly worse than early stopping performance because of the checkpointing strategy.

My point is that you CANNOT guarantee your model is done learning. Again, I will say it, please don't ignore it if you wish to discuss this further: double descent (or overparametrization side-effects in general). Also, there are training setups where you cannot even process the whole dataset and are basically gambling that the dev set you chose is as representative as whatever the model will be seeing in training. You can both overshoot and undershoot. This is not only about the number of steps, but the batch size and learning rate schedules.

> Yes, I have bad news for you if you are under the impression that all published work is fully exploring the entire hparam search space... Are you sampling every learning rate between 1e-7 and 10? That's intractable. Hence, "good enough" or "best seen so far after an exhaustive search".

I was not saying that. What I was saying that even other hyperparameters might be wildly wrong. I take it you have worked with Adam-based optimizers. They generally do not care about hyperparameters in the training period they are most effective with, but other incorrect hyperparameters might have more severe consequences you will simply not be exploring if you early stop. In the modern era, if you have a budget for hyperparameter optimization, you check for a number of steps well beyond what you intend to train, so early stopping has no place outside of very old models, 3+ eternities old. Those are nowadays a special case, given the sheer size of modern models.

> I said that this statement is incorrect "early stopping performance is not indicative of final performance".

And in doing so you have ignored a very prevalent counterexample, double descent. It is not rare (anymore), it is not made up, it is well documented, just poorly understood.

0

aWildTinoAppears t1_irj3m1p wrote

> My point is that you CANNOT guarantee your model is done learning

Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

I've ignored it because the papers you are citing aren't claiming exactly what you hope they are:

> Further, we show at least one setting where model-wise double descent can still occur even with optimal early stopping (ResNets on CIFAR-100 with no label noise, see Figure 19). *We have not observed settings where more data hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude this from occurring. We leave fully understanding the optimal early stopping behavior of double descent as an important open question for future work.*

They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

> 3+ eternities

Moving goal posts again, also dd is from eoy 2019.

> Again, I will say it, please don't ignore it if you wish to discuss this further

I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

1

suflaj t1_irj50vf wrote

> Only theoretical papers are publishing guarantees. DeepMind and OpenAI aren't claiming their networks are "done" training or are perfectly optimal, just that they have passed a performance threshold in which the scientific contribution is worth sharing and they have done an extensive hparam search to reach that point.

Great. Now notice we are speaking of theory. In practice in DL trial and error is usually better than formally analyzing or optimizing something.

> They literally say they sometimes see it, more data isn't bad, and they aren't making any claims around it because it deserves more work.

Great. One thing to notice - you are making claims that early stopping is good enough. I am making claims that because of double descent and not understanding it fully, you cannot make such claims. Those are just guesses, and not even well-informed ones.

To make such claims, the prerequisite would be to first prove (without a reasonable doubt) that your model does not exhibit overparametrization side-effects. This would mean that instead of early stopoing, you run it for way more than you intend to. THEN you can do these checkpointing optimizations, if it turns out you don't have to worry about it.

But usually it is just enough to get it working well enough instead of formally optimizing the hyperparameters, because whatever optimization you do, it cannot account for unseen data. My point is not that this is better, it's that whatever you do you are guessing, and might as well take cheaper guesses if you're not interested in it being very robust.

> Moving goal posts again, also dd is from eoy 2019.

What do you mean moving goal posts again? 3 eternities refers to 6 years ago, i.e. 2016. That is the last time models were small enough for double descent to be basically undetectable, since Attention is All You Need was released in June 2017 and worked on for quite some time then. Double descent was formally described in 2019, yes. But the phenomena it describes happened way before, and in my experience, transformers were the first to exhibit it in pretraining. Maybe it was even more than 3+ eternities ago that we had models that experienced double descent, I have not been doing DL seriously for that long.

> I won't be responding here again but encourage you and RealNetworks to publish some peer reviewed research highlighting the claims you're making in this thread.

You might have gotten the wrong person for the job, as we mostly do engineering, but love that you put in the effort to try and stalk me :)

Given that this has become personal, rather than sticking to the topic, I will not respond anymore either.

1