pia322

pia322 t1_iqtfrt3 wrote

I really like this question. I agree with you that a NN is an arbitrary function approximator, and it could easily implicitly learn the attention function.

I personally embrace the empiricism. We try to make theoretical justifications, but in reality, attention/transformers just happen to work better, and no one really knows why. One could argue that 95% of deep learning research follows this empirical methodology, and the "theory" is an afterthought to make the papers sound nicer.

Why is ResNet better than VGG? Or ViT better than ResNet? They're all arbitrary function approximators, so they should all be able to perform identically well. But empirically, that's not the case.

5