Submitted by bjergerk1ng t3_11542tv in MachineLearning
When designing neural network architectures, it is common to think about "information flow", e.g. how is information propagated, where are the "information bottlenecks" and so on. Another example might be that some people use "information loss" to explain why transformers work better than RNNs.
It seems like most papers discuss this in a rather hand-wavy way. Is there any work done in formalising such ideas to better guide us understanding various model architectures? What are the core ideas?
[deleted] t1_j8zm8ql wrote
[deleted]