Submitted by TheWittyScreenName t3_11s1zfh in MachineLearning

I've found that by dramatically lowering the LR and increasing the number of epochs, very simple, baseline models can outperform SoTA models which use far more parameters. Is this considered "cheating" when comparing models? Is this something interesting enough to warrant a short paper? I'm not sure what to do with this information.

For example, in the original VGAE paper, when training a GAE, they use a LR of 0.01, and train for 200 epochs to get 0.91 AUC, 0.92 AP on a link prediction experiment. Rerunning the same experiment with a LR of 5e-5 for 1500 epochs gets 0.97 AUC, 0.97 AP which is better than the current leader on papers with code for this dataset.

It needs more epochs but has way, way fewer parameters than SoTA models, is this a valid trade-off? Is this even a fair comparison?

22

Comments

You must log in or register to comment.

killver t1_jcbpq7c wrote

You actually rather found an issue in many research papers, that they do unfair comparisons on different methods based on un-tuned hyperparameters. If you run an EfficientNet vs. a VIT model on the same learning rate, you will get vastly different results.

23

camp4climber t1_jcbomzj wrote

Generally it would be unfair to claim that you beat benchmark results if you train for 8x more epochs than other methods. Benchmarks exist to ensure that methods are on a somewhat level playing field. There's certainly some wiggle room depending on the task, but in this case I don't believe that a lower learning rate and more epochs is novel or interesting enough to warrant a full paper.

It's not to say that the work is not worth anything though! There may be a paper in there somewhere if you can further explore some theoretical narratives specific to why that would be the case. Perhaps you can make comparisons to the large models where the total number of FLOPs are fixed. In that case a claim of using a smaller model with more epochs is more efficient than a larger model with fewer epochs would be interesting.

For what it's worth, the optimizer settings in the VGAE paper do not appear to be tuned. I imagine you could improve on their results in much fewer than 1500 epochs by implementing some simple stuff like learning rate decay.

10

farmingvillein t1_jccqy2i wrote

> Generally it would be unfair to claim that you beat benchmark results if you train for 8x more epochs than other methods. Benchmarks exist to ensure that methods are on a somewhat level playing field. There's certainly some wiggle room depending on the task, but in this case I don't believe that a lower learning rate and more epochs is novel or interesting enough to warrant a full paper.

Although the Llama paper is a bit of a rejoinder here, since training longer is (arguably) their core contribution.

3

camp4climber t1_jccy0qa wrote

Yea that's a fair point. These kind of examples certainly exist and often come from large research labs at the very edge of state of the art where the interesting narrative point is scale. The context of specific benchmarks or applications certainly matters.

I still think my point stands in the general case. At least for most of us independent researchers. Ultimately research is about revealing novel insights. Train for longer is not that interesting. But an LLM that fits onto a single GPU, contains 13B parameters, and is capable of outperforming a 175B parameter model is certainly interesting.

2

Daos-Lies t1_jcctn72 wrote

Could I pick you up on your point about it not being interesting enough for a paper.

​

A comprehensive and properly conducted hyperparameter sweep of a selection of state of the art models would provide useful information to the community at large. It would be useful to have the knowledge of what settings are ideal to train any particular model architecture (or checkpoint of that architecture) for any particular type of dataset.

​

There would be variation in the exact hyperparameters that are best for training on the particular dataset of cat pictures the paper used, rather than your own dataset of cat pictures, but the best hyperparameters for any set of cat pictures, on that particular model, are probably going to be quite similar.

And so it is useful to have that knowledge, presented in this hypothetical paper, to refer to when you start training a model on cat pictures.

​

---

​

I have a tendency to treat epochs and learning rate like an accelerator on a car, pumping them up when you want to go faster and bringing them down when you want more control and the ability to check where you're going so you don't miss your exit.

​

Whereas a hyperparameter like C with an svm, I'm much more likely to actually bother with formally looping through and finding the 'right' C than just trying some values and going for it.

​

And the key point there is that SVMs tend to train much much faster than NNs, so I'm not bothering to take the massive extra time it would take to find the 'right' epoch and learning rate. (also epoch and LR are quite intuitive in what they actually mean, which does make them a bit easier to guess at)

​

But if someone had already put the effort in to find the 'right' epoch and LR, even if I was aware that they'd only be approximately 'right' for my particular problem, I'd definitely use that as my starting point.

​

---

​

Ok and I've written quite a lot here already, but I'm going to end by mentioning that in the paper that accompanied the GPT-4 release, they had a whole section on predicting the loss that would be achieved at a certain point in GPT-4s training procedure. Because when you get to training on that scale its pretty costly to guess at your training procedure and so any metrics you have at all on how to get it right the first time are valuable just in terms of the cost of compute time.

​

So yes u/TheWittyScreenName it is worth a paper, and my recommendation would be to have it focused around your conducting a solid and systematic analysis to present.

​

Edit: Well gosh, I've just reread your comment u/camp4climber and you are basically saying the same thing. But maybe my fleshing it out is useful for OP.

2

AccountGotLocked69 t1_jceti2q wrote

I mean... If this holds true for other benchmarks, it would be a huge shock for the entire community. If someone published a paper showing that AlexNet beats ViT on imagenet if you simply train it for ten million epochs, that would be insane. That would mean all the research into architectures we did in the last ten years can be replaced by a good hyperparameter search and training longer.

2

MrTacobeans t1_jcbj1kw wrote

This is coming at a total layman's point of view that follows ai Schmutz pretty closely but anyway...

Wouldn't running a tighter learning variable and longer epoch length reduce many of the benefits of a NN outside of a synthetic benchmark?

From what I know a NN can be loosely trained and helpfully "hallucinate" the gaps it doesn't know and still be useful. When the network is constricted it might be extremely accurate and smaller than the loose model but the intrinsically useful/good hallucinations will be lost and things outside the benchmark will hallucinate worse than the loose model.

I give props to AI engineers this all seems like an incredibly delicate balance and probably why massive amounts of data is needed to prevent either side of this situation.

I feel like there is no need to enforce a epoch/learning curve in benchmarks because usually models converge to their best versions at different points regardless of the data used and if they are making a paper they likely tweaked something that was worth training and writing about beyond beating a benchmark

−7

AccountGotLocked69 t1_jcesw8m wrote

I assume by hallucinate gaps you mean interpolate? In general it's the opposite, smaller simpler models are better at generalizing. Of course there are a million exceptions to this rule, but in the simple picture of using stable combinations of batch sizes and learning rates, big models will be more prone to overfit the data. Most of this rests on the assumption that the "ground truth" is always a simpler function than memorizing the entire dataset.

2