jacobgorm t1_izkpyw6 wrote on December 9, 2022 at 8:35 PM

1x1 conv allows you to connect a set of input activations to a set of outputs. In Mobilenet v1/v2 this is necessary because the 3x3 convs are done separately for each channel, with no cross-channel information flow, unlike in a normal full 3x3 conv where information is able to flow freely across all channels.

In this way, you can view the separable 3x3 as a simple spatial gathering step whose main purpose is to grow the receptive field, and the 1x1 as the place that most of the work happens. It has been shown that you can leave out the 3x3 convolution ENTIRELY and do everything in the 1x1, as long as you are gathering the data in a way that grows the receptive field, e.g., see https://openaccess.thecvf.com/content_cvpr_2018/papers/Wu_Shift_A_Zero_CVPR_2018_paper.pdf .

However, the Mobilenet approach just makes more sense in practice, because if you are going to be reading the data you may as well compute on them and bias/bn+activate the result while you have them loaded into CPU or GPU registers.