Comments

You must log in or register to comment.

neuralbeans t1_ivyhjxo wrote

Usually it's whatever the experimenter likes using together with a little tuning of the numbers.

12

lucidrage t1_iw0fvn4 wrote

It's mostly trial and error and cobbling together training methods used in whatever paper the devs most recently read.

8

arhetorical t1_iw16x4q wrote

It looks like a lot but there's nothing especially weird in there. If you spend some time tuning your model you'll probably end up with something like that too.

Adam - standard.

Linear warmup and decay - warmup and decay is very common. The exact shape might vary but cosine decay is often used.

Decreasing the update frequency - probably something you'd come up with after inspecting the training curve and trying to get a little more performance out of it.

Clipping the gradients - pretty common solution for "why isn't my model training properly". Maybe a bit hacky but if it works, it works.

The numbers themselves are usually just a matter of hand tuning and/or hyperparameter search.

5

BrotherAmazing t1_iw18e2b wrote

Usually if they share their dataset and problem with you, you can find something that is incredibly simple (just normal learning rate decay) and an alternative to gradient clipping, showing it was only “crucial” for their setup but not “crucial” in general, if you spend just a few hours on the problem and have extensive experience with designing and training deep NNs from scratch, and it will work just as well.

Often you can analyze datasets to see which mini-batches had the gradient exceeding various thresholds and understand what training examples led to large gradients and why, and pre-process the data, get rid of the need for clipping, and since the whole thing is nonlinear that might completely invalidate their other hyperparams once the training set is “cleaned up”.

Not saying this is what is going on here with this research group, but you’d be amazed how often this is the case and some complex trial-and-error is being done just to avoid debugging and understanding why the simpler approach that should have worked didn’t.

3

chengstark t1_iw0xw0w wrote

Some trail and error and some common techniques. Warm up, lr scheduling is not hard to think of.

1

ConsiderationCivil74 t1_iw1vtqd wrote

Like the words of the villain in Agents of shield; Discovery requires experimentation

1

artsybashev t1_iw29zh1 wrote

A lot of deep learning has been modern equivalent of witchcraft. Just some ideas that might make sense being squashed together.

Hyperparameter tuning is one of the most obscure and hard to learn part of neural network training since it is hard to do multiple runs with it for models that take more than a few weeks/thousands of dollars to train. Most of the researchers just have learned some good initial guess and might run the model with some set of hyperparameters from which the best result is chosen.

Some of the hyperparameter tunings can also be done with a smaller model and the amount of hyperparameter tuning can be reduced while growing the model to the target size.

1

vk6flab t1_ivxzjs5 wrote

It depends on who's paying.

If it works, it's the idea that the head of marketing came up with over lunch and he'll let everyone know about how insightful and brilliant he is.

If it doesn't work, it's the boat anchor devised by the idiot consultant, hired by the former head of marketing who now is sadly no longer with the company, due to family reasons.

In actuality, likely the intern did it.

Source: I work in IT.

−13