Viewing a single comment thread. View all comments

_Arsenie_Boca_ OP t1_j9ghq1m wrote

That makes a lot of sense. So in that train of thought, bottlenecks are somewhat specific to CNNs, right? Or do you see a similar reasoning in fully connected networks or transformers?

2

buyIdris666 t1_j9h8zqj wrote

I like to think of networks with residual connections as an ensemble of small models cascaded together. The residual connections were created to avoid vanishing/exploding gradient with deep networks.

It's important to realize that each residual connection exposes the layer to the entire input. I don't like the name "residual" because it implies a small amount of data is transferred. Nope, it's the entire input.

Latent information the model has learned along the way passes through the bottleneck. Which supposedly forces it to keep only the most important information. But the explanation above about receptive fields is also prudent.

Beyond vanishing gradient problem that affects all deep networks, one of the biggest challenges with image models is getting them to understand the big picture. Pixels close to each other are very strongly related, so the network preferentially learns these close relations. The bottleneck can be seen as forcing the model to learn global things about the image as resolution is usually halved each layer. A form of image compression if you want to think of it that way.

So the residual connections keep the model from forgetting what it started with, and the bottleneck forces it to learn more than just the relations between close pixels. Similar to the attention mechanism used in Transformers.

CNN's tend to work better than transformers for images because they naturally assume close by pixels affect each other more due to their receptive fields. This makes them easier to train on images. Whether Transformers would work equally well with more training is an open question.

For a similar model to U-Net and other "bottlenecked" CNN architectures check out denoising auto-encoders https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

There is currently speculation that Diffusion models are simply a return of the ancient (7 years old) denoising autoencoder with a fancy training schedule tacked on

This explanation is entirely my intuition because nobody really knows how this stuff works lol

6

Optimal-Asshole t1_j9jo26z wrote

Residual refers to the fact that the NN/bottleneck learns the residual left over after accounting for the entire input. Anyone calling the skip connections “residual connections” should stop though lol

5