Submitted by _Arsenie_Boca_ t3_118cypl in MachineLearning

Many neural architectures use bottleneck layers somewhere in the architecture. What I mean by bottleneck is projecting activations to a lower dimension and back up. This is e.g. used in ResNet blocks.

What is your intuition on why this is beneficial? From an information theory standpoint, it creates potential information loss due to the lower dimensionality. Can we see this as a form of regularisation, that makes the model learn more meaningful representations?

Im interested in your intuitions in that matter or empirical results that might support these intuitions. Are you aware of other works that use bottlenecks and what is their underlying reasoning?

42

Comments

You must log in or register to comment.

Professional_Poet489 t1_j9gh652 wrote

The theory is that bottlenecks are a compression / regularization mechanism. If you have a smaller number of parameters in the bottleneck than overall in the net, and you get high quality results from the output, then the bottleneck layer must be capturing the information required to drive the output to the correct results. The fact that these intermediate layers are often used for embeddings indicates that this is a real phenomenon.

32

_Arsenie_Boca_ OP t1_j9gix7q wrote

If I understand you correctly, that would mean that bottlenecks only interesting when

a) you further use the lower dimensional features as output like in autoencoders b) you are interested in knowing if your features have lower intrinsic dimension

Both are not met in many cases such as normal ResNets. Could you elaborate how you believe bottlenecks act as regularizers?

2

Professional_Poet489 t1_j9gk545 wrote

Re: regularization - by using fewer numbers to represent the same output info, you are implicitly reducing the dimensionality of your function approximate.

Re: (a), (b) Generally in big nets, you want to regularize because you will otherwise overfit. It’s not about the output dimension, it’s that you have a giant approximator (ie a billion params) fitting a much smaller data dimensionality and you have to do something about that. The output can be “cat or not” and you’ll still have the same problem.

9

currentscurrents t1_j9gvv4k wrote

a) Lower-dimensional features are useful for most tasks, not just output and b) Real data almost always has a lower intrinsic dimension.

For example if you want to recognize faces, you'd have a much easier time recognizing patterns in things like gender, shape of facial features, hair color, etc rather than raw pixel data. Most pixel values are irrelevant.

5

currentscurrents t1_j9gp4uq wrote

> From an information theory standpoint, it creates potential information loss due to the lower dimensionality.

Exactly! That's the point.

The bottleneck forces the network to throw away the parts of the data that don't contain much information. It learns to encode the data in an information-dense representation so that the decoder on the other side of the bottleneck can work with high-level ideas instead of pixel values.

If you manually tweak the values in the bottleneck, you'll notice it changes high-level ideas in the data like the gender or shape of a face, not pixel values. This is how autoencoders work; a unet is basically an autoencoder with skip connections.

Interestingly, biological neural networks that handle feedforward perception seem to do the same thing. Take a look at the structure of an insect antenna; thousands of input neurons bottleneck down to only 150 neurons, before expanding again for processing in the rest of the brain.

26

txhwind t1_j9n63wz wrote

One of keys to intelligence is learning to forget noncritical information. I think it might be a weak point of large language model.

1

MediumOrder5478 t1_j9ggg6y wrote

Usually it is to increase the receptive field of the network at a given location (more spatial context). Higher resolution features are then recovered via skip connections if necessary

21

_Arsenie_Boca_ OP t1_j9ghq1m wrote

That makes a lot of sense. So in that train of thought, bottlenecks are somewhat specific to CNNs, right? Or do you see a similar reasoning in fully connected networks or transformers?

2

buyIdris666 t1_j9h8zqj wrote

I like to think of networks with residual connections as an ensemble of small models cascaded together. The residual connections were created to avoid vanishing/exploding gradient with deep networks.

It's important to realize that each residual connection exposes the layer to the entire input. I don't like the name "residual" because it implies a small amount of data is transferred. Nope, it's the entire input.

Latent information the model has learned along the way passes through the bottleneck. Which supposedly forces it to keep only the most important information. But the explanation above about receptive fields is also prudent.

Beyond vanishing gradient problem that affects all deep networks, one of the biggest challenges with image models is getting them to understand the big picture. Pixels close to each other are very strongly related, so the network preferentially learns these close relations. The bottleneck can be seen as forcing the model to learn global things about the image as resolution is usually halved each layer. A form of image compression if you want to think of it that way.

So the residual connections keep the model from forgetting what it started with, and the bottleneck forces it to learn more than just the relations between close pixels. Similar to the attention mechanism used in Transformers.

CNN's tend to work better than transformers for images because they naturally assume close by pixels affect each other more due to their receptive fields. This makes them easier to train on images. Whether Transformers would work equally well with more training is an open question.

For a similar model to U-Net and other "bottlenecked" CNN architectures check out denoising auto-encoders https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

There is currently speculation that Diffusion models are simply a return of the ancient (7 years old) denoising autoencoder with a fancy training schedule tacked on

This explanation is entirely my intuition because nobody really knows how this stuff works lol

6

Optimal-Asshole t1_j9jo26z wrote

Residual refers to the fact that the NN/bottleneck learns the residual left over after accounting for the entire input. Anyone calling the skip connections “residual connections” should stop though lol

5

TemperatureStatus435 t1_j9gk2mn wrote

Regularization in some vague sense applies, but there are different kinds of that, so you must be more specific. For example, an Autoencoder uses a bottleneck layer to learn information-dense representations of the domain space, and it may employ some mathematical regularization so that the raw numbers don’t explode to infinity.

However, a Variational Autoencoder employs the above methods, but also an additional type of regularization. The effect of this is to normalize the shape of the bottleneck layer so that it is close to Gaussian. This is extremely useful to do, but for entirely different reasons.

Long story short, don’t just say “regularization” and think you understand what’s going on.

6

MonsieurBlunt t1_j9glzsp wrote

Accomodating as much space for information as you can is not really a good idea. It is prone to overfitting and also harder to learn. You can think of it as a way of regularisation, you are forcing the model to get the useful information and not the rest or, you leave less space where it can encode the training data to overfit.

3

baffo32 t1_j9j4qbh wrote

Reducing the information a system can represent requires it to learn generalized patterns rather than memorize events. In machine learning this will increase transfer some.

2

LudaChen t1_j9i0vmp wrote

To put it simply, the bottleneck layers is a process of reducing dimension first and then increasing dimension. So, why do we need to do this?

In theory, not reducing dimensionality can preserve the most information and more features, which is certainly not a problem. However, for specific tasks, not all features are equally important, and some features may even have a negative impact on the results. Therefore, we need to select some features that we should pay more attention to through some means, and reducing dimensionality can to some extent achieve this function. On the other hand, increasing dimensionality is to enhance the representational ability of the network. Although the channel number of the features after increasing dimensionality is the same as that before reducing dimensionality, the latter is actually restored from low-dimensional features, and the former can be considered to be more specific to the current task.

1

aMericanEthnic t1_j9gf0l3 wrote

Bottlenecks are typically a point that is outside of control, purposeful implementation of a bottleneck can only be explained as an attempt at ambiguity in the sense that it’s an attempt to appear of create the feel of a real world issue’ , they “bottlenecks” are unnecessary and should be removed…

−14

_Arsenie_Boca_ OP t1_j9gg06n wrote

Thanks for your comment. Could you elaborate? Do you mean bottlenecks dont have any benefit? If so, why would people use them?

1