AccountGotLocked69

AccountGotLocked69 t1_jceti2q wrote

I mean... If this holds true for other benchmarks, it would be a huge shock for the entire community. If someone published a paper showing that AlexNet beats ViT on imagenet if you simply train it for ten million epochs, that would be insane. That would mean all the research into architectures we did in the last ten years can be replaced by a good hyperparameter search and training longer.

2

AccountGotLocked69 t1_jcesw8m wrote

I assume by hallucinate gaps you mean interpolate? In general it's the opposite, smaller simpler models are better at generalizing. Of course there are a million exceptions to this rule, but in the simple picture of using stable combinations of batch sizes and learning rates, big models will be more prone to overfit the data. Most of this rests on the assumption that the "ground truth" is always a simpler function than memorizing the entire dataset.

2