Submitted by bjergerk1ng t3_11542tv in MachineLearning

When designing neural network architectures, it is common to think about "information flow", e.g. how is information propagated, where are the "information bottlenecks" and so on. Another example might be that some people use "information loss" to explain why transformers work better than RNNs.

It seems like most papers discuss this in a rather hand-wavy way. Is there any work done in formalising such ideas to better guide us understanding various model architectures? What are the core ideas?

34

Comments

You must log in or register to comment.

currentscurrents t1_j8zq4tn wrote

>I wouldn’t say it’s common to design networks with information flow in mind

I disagree. The entire point of the attention mechanism in transformers is to have a second neural network to control the flow of information.

Similarly, the autoencoder structure is ubiquitous these days, and it's based around the idea of forcing information to flow through a bottleneck. Some information must be thrown away, so the neural network learns which parts of the data are most important to keep, and you get a good understanding of the structure of the data.

I'd say many of the recent great ideas in the field have come from manipulating information flow in interesting ways.

11

filipposML t1_j90m8x0 wrote

Maybe you are interested in Tishby's rate distortion. E.g. in this paper they do an analysis of the behaviour of mutual information in the hidden layers as a neural network is trained to convergence.

11

afireohno t1_j9230xx wrote

There are two lines of work that come to mind you might be interested in.

  1. Geometric deep learning primarily studies various types of invariances (translation, permutation, etc) that can be encoded in DL architectures.
  2. Algorithmic alignment studies the relationship between information flow in classical algorithms and DL architectures and how "aligning" the latter to the former can improve performance.

Edit: Spelling

6

velcher t1_j934snb wrote

You might be interested in V-information, which specifically looks at information from a computational efficiency point of view.

For example, classical mutual information will say an encrypted version of the message and the original message will have high MI, but we know practically that it is hard to extract the message from the encryption. Therefore, there will be low V-info in this case.

2