Submitted by Ananth_A_007 t3_zgpmtn in MachineLearning

I am aware that one 1x1 convolution is needed for separable convolution but when else is it useful. I see it used in mobilenetv2 before the depthwise separable convolution later in the bottleneck but not sure why. I also see it used with stride 2 when max pooling could be used instead. Could someone please explain the logic behind this. Thanks.

3

Comments

You must log in or register to comment.

MathChief t1_izjarfb wrote

1x1 conv is essentially a linear transformation (of number of channels) as the other redditor suggests, same as nn.Linear in PyTorch.

What I would to add is in PyTorch the 1x1 conv by default accepts tensor of shapes (B, C, *), for example (B, C, H, W) in 2d, this is convenient for implementing purposes. If you use nn.Linear, the channel dimension has to be first permuted to the last, and then applying the linear transformation, and permuted back. While using the 1x1 conv, which is essentially a wrapper for the C function that does the einsum automatically, it is just a single line thus the code is cleaner and less error prone.

7

quagg_ t1_izjvgkv wrote

To add onto this point about it being the same as nn.Linear in PyTorch, it is useful in hypernetwork applications where you have a data-conditioned (a context set, partially observed sequence, etc.) hypernetwork. Because of the data-conditioning, each sample has a different main-network MLP and doesn't inherently allow for batching.

If you want to parallelize over multiple different MLPs at one time still, then you can use 1x1 convolutions in 1D alongside the "groups" tag to enable running through all of those separate networks at the same time, saving you from sequential processing at the cost of a larger CNN (NumNodes * BatchSize as filters) in each convolution layer.

4

Mefaso t1_izmyrsw wrote

Oh that sounds very useful, you don't randomly happen to know a coded example of that?

1

quagg_ t1_izn5ha7 wrote

No 3rd party ones (that I know of), but I've an implementation of my own. Give me a day and I'll setup a repo to share it!

2

quagg_ t1_izr231j wrote

Here's the first (quick) version up! https://github.com/qu-gg/torch-hypernetwork-tutorials

It currently only contains simple examples on MNIST to highlight the implementation structure, however I'll add time-series (e.g. neural ODE) examples and better README explanations when I can over time. Let me know if you have any questions or feel free to just ask in the Issues.

Thanks!

1

jacobgorm t1_izkpyw6 wrote

1x1 conv allows you to connect a set of input activations to a set of outputs. In Mobilenet v1/v2 this is necessary because the 3x3 convs are done separately for each channel, with no cross-channel information flow, unlike in a normal full 3x3 conv where information is able to flow freely across all channels.

In this way, you can view the separable 3x3 as a simple spatial gathering step whose main purpose is to grow the receptive field, and the 1x1 as the place that most of the work happens. It has been shown that you can leave out the 3x3 convolution ENTIRELY and do everything in the 1x1, as long as you are gathering the data in a way that grows the receptive field, e.g., see https://openaccess.thecvf.com/content_cvpr_2018/papers/Wu_Shift_A_Zero_CVPR_2018_paper.pdf .

However, the Mobilenet approach just makes more sense in practice, because if you are going to be reading the data you may as well compute on them and bias/bn+activate the result while you have them loaded into CPU or GPU registers.

2

arg_max t1_izpadl8 wrote

I think the most prominent use case in CNN is as a very simple, localised and fast operation that changes the number of channels without touching the spatial dimensions.

For example, deep resnets have a bottleneck design. The input is something like a Nx256xHxW Tensor (N batch size, H, W spatial dimensions) with 256 channels. To save compute/memory, we might not want to actually use the 3x3 conv on all 256 channels. Thus we use a 1x1 conv first to change the number of channels from 256 to 64. On this smaller Tensor, we then implement a 3x3 conv that doesn't change the number of channels. Finally, we use another 1x1 conv to convert back from 64 to 256 channels. So here the first 1x1 conv decreases the number of channels while the second one restores the output back to the original shape with 256 channels.

2

IntelArtiGen t1_izipwih wrote

1x1 convolutions are practical when you need to change the shape of a tensor. If you have a tensor of shape (B, H, W, 128) you can use an 1x1 to have a tensor of shape (B, H, W, 64) without loosing too much information.

You can use an 1x1 with stride 2 in place of a max pooling depending on your constraints. It could perform better, it could be more computationally intensive or take an extra memory you don't have.

For mobilenetv2 I think you're talking about inverted residual / linear bottleneck? I think the point of this layer is to expand and then compress the information, plus it's a residual layer. Because the 1x1 allows you to efficiently expand and compress a tensor, you can use it to do these steps in this layer, and to re-shape the tensor so that it can be added as a residue. It seems that "expand / process (dwise) / compress / residue" requires less parameters for the same result than just doing "process / process / process" as we usually do, or even "process / residue ..." in resnet. However it's not easier for the algorithm to learn so the training might be longer and still be more parameter efficient.

If you're working on new neural network architectures, you have to be able to manipulate tensors of different shapes, 1x1 essentially helps to change shapes of tensors while keeping information.

1

Ananth_A_007 OP t1_izq8op0 wrote

But if we use 1x1 with stride 2, aren't we just slipping half the information without even looking at it? Like at least in max pooling, the filters see all the pixels before shrinking dimensions.

1

IntelArtiGen t1_izqc26r wrote

The information you have before a layer is conditioned by how it goes into that layer, at first the information that goes into that layer is noise, weights change depending on the loss such that when the information goes into that layer it reduces the loss, and becomes something meaningful.

So the question would be: is it better for information processing in the neural network to compare 2x2 values and take the max? or is it better to train the network such that it can put the correct information in 1 of the 2x2 values and always keep that one?

I think the answer depends on the dataset, the model and the training process.

And I think the point of that layer isn't necessarily to look at everything but just to shrink dimensions without loosing too much information. Perhaps looking at everything is not required to keep enough information.

2

ML4Bratwurst t1_izkn095 wrote

I used it in an architecture to create an pixel wise linear combination of a segmentation in latent space

1