nonotan t1_j1yovo3 wrote on December 28, 2022 at 11:12 AM

Reply to comment by derpderp3200 in [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200

It's probably not the most efficient method. However, in general, methods that converge faster tend to lead to slightly worse minima (think momentum-based methods vs "plain" SGD), which "intuitively" makes some degree of sense (the additional time spent training isn't completely wasted, with some of it effectively helping explore the possibility space, optimizing the model in ways that simple gradient-following might miss entirely)

I would be shocked if there doesn't exist a method that does even better than SGD while also being significantly more efficient. But it's probably not going to be easy to find, and I expect most simple heuristics ("this seems to be helping, do it more" or "this doesn't seem to be helping, do it less") will lead to training time vs accuracy tradeoffs, rather than universal improvements.