Viewing a single comment thread. View all comments

Daos-Lies t1_jcctn72 wrote

Could I pick you up on your point about it not being interesting enough for a paper.

​

A comprehensive and properly conducted hyperparameter sweep of a selection of state of the art models would provide useful information to the community at large. It would be useful to have the knowledge of what settings are ideal to train any particular model architecture (or checkpoint of that architecture) for any particular type of dataset.

​

There would be variation in the exact hyperparameters that are best for training on the particular dataset of cat pictures the paper used, rather than your own dataset of cat pictures, but the best hyperparameters for any set of cat pictures, on that particular model, are probably going to be quite similar.

And so it is useful to have that knowledge, presented in this hypothetical paper, to refer to when you start training a model on cat pictures.

​

---

​

I have a tendency to treat epochs and learning rate like an accelerator on a car, pumping them up when you want to go faster and bringing them down when you want more control and the ability to check where you're going so you don't miss your exit.

​

Whereas a hyperparameter like C with an svm, I'm much more likely to actually bother with formally looping through and finding the 'right' C than just trying some values and going for it.

​

And the key point there is that SVMs tend to train much much faster than NNs, so I'm not bothering to take the massive extra time it would take to find the 'right' epoch and learning rate. (also epoch and LR are quite intuitive in what they actually mean, which does make them a bit easier to guess at)

​

But if someone had already put the effort in to find the 'right' epoch and LR, even if I was aware that they'd only be approximately 'right' for my particular problem, I'd definitely use that as my starting point.

​

---

​

Ok and I've written quite a lot here already, but I'm going to end by mentioning that in the paper that accompanied the GPT-4 release, they had a whole section on predicting the loss that would be achieved at a certain point in GPT-4s training procedure. Because when you get to training on that scale its pretty costly to guess at your training procedure and so any metrics you have at all on how to get it right the first time are valuable just in terms of the cost of compute time.

​

So yes u/TheWittyScreenName it is worth a paper, and my recommendation would be to have it focused around your conducting a solid and systematic analysis to present.

​

Edit: Well gosh, I've just reread your comment u/camp4climber and you are basically saying the same thing. But maybe my fleshing it out is useful for OP.

2