I like the geometric interpretation.
It's not exactly the theory that explains it all, but I find it a very useful framework for thinking about these type of questions.
The main way I understand it is in terms of the space of possible functions that a model has to consider. Also closely related to the "curse of dimensionality" - given that data is finite and cannot fill the space in all dimensions - a generic universal function approximator will never have dense enough input space to learn useful representations. So geometric priors are necessary to reduce the space of functions.
I'm fascinated by the work of Michael Bronstein and friends.
TheJulianInside t1_iquno3u wrote
Reply to [D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) by 029187
I like the geometric interpretation. It's not exactly the theory that explains it all, but I find it a very useful framework for thinking about these type of questions.
The main way I understand it is in terms of the space of possible functions that a model has to consider. Also closely related to the "curse of dimensionality" - given that data is finite and cannot fill the space in all dimensions - a generic universal function approximator will never have dense enough input space to learn useful representations. So geometric priors are necessary to reduce the space of functions.
I'm fascinated by the work of Michael Bronstein and friends.
Note: I'm very far from an expert on any of this