Comments

You must log in or register to comment.

tysam_and_co t1_izviq7y wrote

I believe that it depends. I've played around with this on one of my projects, though not exhaustively, and it seemed like the averaging helped a lot. Here's the definition line where the global average pooling is: https://github.com/tysam-code/hlb-CIFAR10/blob/d683ee95ff0a8dde5e4b9c9d4425f49de7fe9805/main.py#L332

This is not originally my network, but it's a small stepdown from 4x4 to 1x1 -- only a 16x reduction in information.

That being said, at a higher dimension, neural networks, convnets in this case, can operate like loose classifiers over a "bag of textured features", as was the craze back from some smaller research threads maybe in the 2016-2018/2019 range or so. So in that case, once you get to the higher dimensions, you're just "feature voting" anyways, so you really don't gain or lose too much with the global average pooling.

I'm sure Transformers work in a different kind of way, but that's a vastly different kind of inductive bias that they are using.

2

DeepGamingAI t1_izwiy5k wrote

Don't vision transformers do this where instead of gradually compressing the input like a typical convnet, they maintain the high dimensionality throught all the blocks of the deep network, and then simply using a global pooling at the end to compress the "channel" dimension into a compressed representation? I have no idea why that works, but we have seen it does work and the model still learns despite the gradients flowing through this average pooling layer at the end. Would be great if someone can help clarify this for me.

1

tdgros t1_izwkav3 wrote

ViTs keep the same dimension because of the residual connections in the transformer blocks.

At the very end, you want to sum-up the information if you want to do classification, but because all tokens are equivalent, you just average them before further decoding i.e. if you concatenated all the tokens before a linear layer, it'd end up looking like a global pooling.

2

DeepGamingAI t1_izwolrw wrote

Thanks, that clarifies some things. I have also seen a parameter in the ViT head that simply returns the first token representation instead of averaging across all tokens. I never understood why that made sense, and why only the first token and not some other random token.

This also reminds me of another confusion I have about transformers, would they lose meaning if we gradually compress the embedding size after every mlp in the transformer block?

1

tdgros t1_izwppx0 wrote

You can either take all the existing tokens, average and decode them into logits. But if you can do that, you can also do that with one single token after all.

Or you can append a special learned token at some point, that will have its special decoder, I believe that's what you're describing. You can find this approach in BERT where a CLS token is inserted before any sentence. One final similar approach is Perceiver IO's where the decoder is a transformer whose query is a learned array.

You can change the dimensionality with 1x1 in between transformers, you wouldn't lose meaning but expressivity or capacity. I'm not sure that's recommended, it's not immoral or illegal.

1

DeepGamingAI t1_izwql0s wrote

>I'm not sure that's recommended, it's not immoral or illegal.

Humans may not consider that design choice immoral but I don't want to offend our soon-to-be AI overlords. Maybe I'll ask chatGPT if it will judge me for doing that.

1

tdgros t1_izwrdlr wrote

I am designing a vision transformer for image classification. What do you feel about inserting 1x1 convolutions in between transformer blocks to reduce the dimensionality? would you feel offended if I gradually did that throughout my backbone?

As a large language model trained by OpenAI, I don't have feelings and I am not capable of being offended. I am also not able to browse the internet, so I am unable to provide specific information about using 1x1 convolutions in a vision transformer for image classification. However, in general, using 1x1 convolutions can be a useful technique for reducing the dimensionality of an input in a convolutional neural network, which can help to improve the computational efficiency of the network and reduce the number of parameters it has. Whether or not this technique is appropriate for your specific use case will depend on the details of your model and the data you are working with. It's always a good idea to experiment with different architectures and techniques to see what works best for your particular application.

1

DeepGamingAI t1_izwthb9 wrote

It's just like a girlfriend. "No I will not be offended if you did this" but then goes ahead and takes it personally when you do it.

1