Submitted by netw0rkf10w t3_zmpdo0 in MachineLearning

I am looking for the hyper-parameter settings that could produce the highest accuracies for plain ViT (i.e., without modifying the model architecture) on ImageNet-1K, training from scratch. A lot of people in this sub have experience with ViT so I hope I could get some help here.

For ViT-S, we have a recipe that can achieve 80.0% top-1 accuracy from this paper: Better plain ViT baselines for ImageNet-1k. Unfortunately they did not experiment with larger architecture (ViT-B or ViT-L).

For ViT-B, ViT-L and ViT-H, the authors of MAE claimed to achieve 82.3%, 82.6% and 83.1%, respectively (see their Table 3). However, I was unable to reproduce these results using their code and their reported hyper-parameters.

Any references to strong ViT baselines with reproducible results would be very much appreciated! Thanks.



You must log in or register to comment.

TimDarcet t1_j0cpta3 wrote

I think Deit III is pretty sota


TimDarcet t1_j0cpy9m wrote

There's also this one with very strong results, but it's a bit less straightforward to train


netw0rkf10w OP t1_j0gcgxy wrote

Thanks. DeiT is actually a very nice paper from which one can learn a lot of things. But the training regimes that they used seem a bit long to me: 300 to 800 epochs. The authors of MAE managed to achieve 82.3% for ViT-B after only 100 epochs, so I'm wondering if anyone in the literature has ever been able to match that.


TimDarcet t1_j1w6ifs wrote

I think the supervised training they report in MAE is 300 epochs, they used a different recipe compared to finetuning (appendix, page 12, table 11)


netw0rkf10w OP t1_j2939o2 wrote

You are right, indeed. Not sure why I missed that. I guess one can conclude that DeiT 3 is currently SoTA for training from scratch.