When designing neural network architectures, it is common to think about "information flow", e.g. how is information propagated, where are the "information bottlenecks" and so on. Another example might be that some people use "information loss" to explain why transformers work better than RNNs.

It seems like most papers discuss this in a rather hand-wavy way. Is there any work done in formalising such ideas to better guide us understanding various model architectures? What are the core ideas?

Comments

You must log in or register to comment.

[deleted] t1_j8zm8ql wrote on February 18, 2023 at 2:27 AM

#1,852,008

[deleted]

bjergerk1ng OP t1_j8zo74z wrote on February 18, 2023 at 2:43 AM

#1,852,201

Replying to [deleted] (#1,852,008)

Good point — what's shown in the paper (just skimmed through it) seems quite promising, wonder why this approach isn't seen more in literature

[deleted] t1_j8zocyb wrote on February 18, 2023 at 2:44 AM

#1,852,215

[deleted]

currentscurrents t1_j8zq4tn wrote on February 18, 2023 at 2:59 AM

#1,852,381

Replying to [deleted] (#1,852,215)

>I wouldn’t say it’s common to design networks with information flow in mind

I disagree. The entire point of the attention mechanism in transformers is to have a second neural network to control the flow of information.

Similarly, the autoencoder structure is ubiquitous these days, and it's based around the idea of forcing information to flow through a bottleneck. Some information must be thrown away, so the neural network learns which parts of the data are most important to keep, and you get a good understanding of the structure of the data.

I'd say many of the recent great ideas in the field have come from manipulating information flow in interesting ways.

Phoneaccount25732 t1_j90eyfv wrote on February 18, 2023 at 7:17 AM

#1,854,296

Replying to currentscurrents (#1,852,381)

This is my preferred interpretation of RESNETs too.

currentscurrents t1_j90hs2i wrote on February 18, 2023 at 7:54 AM

#1,854,465

Replying to Phoneaccount25732 (#1,854,296)

Yeah, the skip connections allow higher layers to have access to information from lower layers. Same thing goes for U-Nets; they're basically an autoencoder with skip connections.

anonymousTestPoster t1_j90i96k wrote on February 18, 2023 at 8:00 AM

#1,854,497

Replying to bjergerk1ng (#1,852,201)

what did the person link? Lol why is everything getting deleted in this thread?

filipposML t1_j90m8x0 wrote on February 18, 2023 at 8:55 AM

#1,854,706

Maybe you are interested in Tishby's rate distortion. E.g. in this paper they do an analysis of the behaviour of mutual information in the hidden layers as a neural network is trained to convergence.

bjergerk1ng OP t1_j90mnb7 wrote on February 18, 2023 at 9:01 AM

#1,854,735

Replying to anonymousTestPoster (#1,854,497)

He linked https://arxiv.org/abs/1905.04271, not sure what is happening lol.

afireohno t1_j9230xx wrote on February 18, 2023 at 5:23 PM

#1,858,959

There are two lines of work that come to mind you might be interested in.

Geometric deep learning primarily studies various types of invariances (translation, permutation, etc) that can be encoded in DL architectures.
Algorithmic alignment studies the relationship between information flow in classical algorithms and DL architectures and how "aligning" the latter to the former can improve performance.

Edit: Spelling

velcher t1_j934snb wrote on February 18, 2023 at 9:49 PM

#1,862,155

You might be interested in V-information, which specifically looks at information from a computational efficiency point of view.

For example, classical mutual information will say an encrypted version of the message and the original message will have high MI, but we know practically that it is hard to extract the message from the encryption. Therefore, there will be low V-info in this case.