tysam_and_co t1_izviq7y wrote on December 12, 2022 at 3:46 AM

I believe that it depends. I've played around with this on one of my projects, though not exhaustively, and it seemed like the averaging helped a lot. Here's the definition line where the global average pooling is: https://github.com/tysam-code/hlb-CIFAR10/blob/d683ee95ff0a8dde5e4b9c9d4425f49de7fe9805/main.py#L332

This is not originally my network, but it's a small stepdown from 4x4 to 1x1 -- only a 16x reduction in information.

That being said, at a higher dimension, neural networks, convnets in this case, can operate like loose classifiers over a "bag of textured features", as was the craze back from some smaller research threads maybe in the 2016-2018/2019 range or so. So in that case, once you get to the higher dimensions, you're just "feature voting" anyways, so you really don't gain or lose too much with the global average pooling.

I'm sure Transformers work in a different kind of way, but that's a vastly different kind of inductive bias that they are using.