IntelArtiGen t1_izipwih wrote on December 9, 2022 at 12:07 PM

1x1 convolutions are practical when you need to change the shape of a tensor. If you have a tensor of shape (B, H, W, 128) you can use an 1x1 to have a tensor of shape (B, H, W, 64) without loosing too much information.

You can use an 1x1 with stride 2 in place of a max pooling depending on your constraints. It could perform better, it could be more computationally intensive or take an extra memory you don't have.

For mobilenetv2 I think you're talking about inverted residual / linear bottleneck? I think the point of this layer is to expand and then compress the information, plus it's a residual layer. Because the 1x1 allows you to efficiently expand and compress a tensor, you can use it to do these steps in this layer, and to re-shape the tensor so that it can be added as a residue. It seems that "expand / process (dwise) / compress / residue" requires less parameters for the same result than just doing "process / process / process" as we usually do, or even "process / residue ..." in resnet. However it's not easier for the algorithm to learn so the training might be longer and still be more parameter efficient.

If you're working on new neural network architectures, you have to be able to manipulate tensors of different shapes, 1x1 essentially helps to change shapes of tensors while keeping information.

Ananth_A_007 OP t1_izq8op0 wrote on December 11, 2022 at 1:03 AM

But if we use 1x1 with stride 2, aren't we just slipping half the information without even looking at it? Like at least in max pooling, the filters see all the pixels before shrinking dimensions.

IntelArtiGen t1_izqc26r wrote on December 11, 2022 at 1:31 AM

The information you have before a layer is conditioned by how it goes into that layer, at first the information that goes into that layer is noise, weights change depending on the loss such that when the information goes into that layer it reduces the loss, and becomes something meaningful.

So the question would be: is it better for information processing in the neural network to compare 2x2 values and take the max? or is it better to train the network such that it can put the correct information in 1 of the 2x2 values and always keep that one?

I think the answer depends on the dataset, the model and the training process.

And I think the point of that layer isn't necessarily to look at everything but just to shrink dimensions without loosing too much information. Perhaps looking at everything is not required to keep enough information.