recordertape

recordertape t1_iqvs43g wrote

While I have no idea about the low-level impact of attention, I'd like to note that (at least for image recognition) a lot of (transformer) architecture papers achieve SOTA results/improvements due to superior data augmentation & losses instead of architectural advances (although they might write a whole paper about the architecture and only mention the training tricks in supplementary...) . Many transformer papers use quite complicated setups with RandAug/CutMix augmentations, distillation losses, EMA weights etc. "ResNet strikes back" shows that ResNet accuracy is significantly boosted by using similar training pipelines and ConvNext achieves results similar to transformers. While I'm not an expert in DNN architectures, I'd guess some hybrid with interleaved conv/transformer layers could be optimal, with the conv layers extracting local features and transformer layers for long-range relationships. Probably something like MobileViT. But if I'd have to pick something for production/prototyping now it'd just be a ResNet. Well-supported and optimized in libraries/hardware and forgiving to train.

2