camp4climber

camp4climber t1_jccy0qa wrote

Yea that's a fair point. These kind of examples certainly exist and often come from large research labs at the very edge of state of the art where the interesting narrative point is scale. The context of specific benchmarks or applications certainly matters.

I still think my point stands in the general case. At least for most of us independent researchers. Ultimately research is about revealing novel insights. Train for longer is not that interesting. But an LLM that fits onto a single GPU, contains 13B parameters, and is capable of outperforming a 175B parameter model is certainly interesting.

2

camp4climber t1_jcbomzj wrote

Generally it would be unfair to claim that you beat benchmark results if you train for 8x more epochs than other methods. Benchmarks exist to ensure that methods are on a somewhat level playing field. There's certainly some wiggle room depending on the task, but in this case I don't believe that a lower learning rate and more epochs is novel or interesting enough to warrant a full paper.

It's not to say that the work is not worth anything though! There may be a paper in there somewhere if you can further explore some theoretical narratives specific to why that would be the case. Perhaps you can make comparisons to the large models where the total number of FLOPs are fixed. In that case a claim of using a smaller model with more epochs is more efficient than a larger model with fewer epochs would be interesting.

For what it's worth, the optimizer settings in the VGAE paper do not appear to be tuned. I imagine you could improve on their results in much fewer than 1500 epochs by implementing some simple stuff like learning rate decay.

10