Pitiful-Ad2546 t1_izke0jv wrote on December 9, 2022 at 7:16 PM

I second the PCA suggestion. Also sufficient statistics, inductive bias, and learning theory (not that the generalization of MLPs is well understood, but the concepts of true vs empirical data distribution, bayes risk, etc.).

The answer to your question depends on the data and the model. If the extrinsic data dimension (e.g., you have a 10 dimensional data vector) is higher than the intrinsic dimension (maybe the data is distributed in a low dimensional subspace or manifold), then you don’t necessarily need the full data representation to solve any problem. If data don’t have lower intrinsic dimension, but the features relevant to the problem you are trying to solve are intrinsically low dimensional, you don’t necessarily need the full data representation to solve the problem.

The universal approximation theorem is great, but in practice it can be very hard to learn certain functions with certain architectures, this is why papers proposing a new architecture might get sota performance on a problem - they have found an architecture with a good inductive bias for their particular problem.