Submitted by OutOfCharm t3_zgvr0h in MachineLearning

I am wondering, is there any relationship between the dimension of the input vector and the ability of the output that it can represent? (e.g. can I say that a 10-dimensional feature vector has better representation ability against a 5-dimensional one, assuming data are sufficient to train a model.) Or if not, can you suggest any reference to the formal induction that illustrates that relationship?

1

Comments

You must log in or register to comment.

Zealousideal_Golf252 t1_iziufir wrote

It may not directly answer your question, but you can search for universal approximation theory.

4

OutOfCharm OP t1_izix0ig wrote

I get it. Is there any relevant theorem indicating the lower dimensional representation, such as x in R^N, while we are interested in x' in its subspace R^M (M < N).

1

IntelArtiGen t1_iziveld wrote

I don't know if I entirely got the question but I can try to answer. With one number you can in theory represent an infinite amount of information. In practice, on computers, we don't have an infinite amount of precision on one number (fp16, fp32 etc.), and we don't have an infinite amount of precision on how a DL algorithm can interpret this number. If -1 and +1 are two different pieces of information, it's fine. If 0.9999999 and 1.000001 are two different pieces of information, a DL algorithm will have trouble learning it.

So there is a relationship because for practical reasons we can't represent everything in one number. But there also is a limit, if you can fit all the information you need in 10 values, using 100000 values to represent it won't help. And if you want to know what is the right value in theory I'm afraid you can't because it depends on the dataset, the model and the training process.

Perhaps this has a bit to do with information theory. But I'm not aware of an information theory that would focus on DL, this field of science is maybe under-investigated.

1

OutOfCharm OP t1_iziwmbt wrote

That's a point! Would you agree that if with 10 values it is sufficient to fit all the information you need, decreasing the values to e.g. 3 must harm the performance?

1

IntelArtiGen t1_izj2jc3 wrote

The amount of values must be sufficient and the model must be able to process these values. We could imagine a model which would not perform well with 10 values because it's too much to process but it would perform better with 3 values, even though the "perfect model" would need 10 values to give the best results.

1

MRsockman314 t1_izjyueu wrote

You will get a lot of value out of looking at PCA.

Additionally Auto Encoders. Just a quick overview. If you have a set of images of faces as your input data, you should be able to see that any random collection of pixels is not a valid face. We can use deep learning models to "encode" this information of valid faces into lower dimensions. Where originally you have a 28x28 grey scale image, taking with values 0-255, can be compressed to a maybe 100 dimensions. There are quite a few subtleties, Kingma has a great review paper that should help.

For a 10 dimensional vector it really depends on the data, the more structure it has the more easily it can be compressed.

Information theory will be a great Google. I would recommend David MacKay's book if you are really interested.

1

Pitiful-Ad2546 t1_izke0jv wrote

I second the PCA suggestion. Also sufficient statistics, inductive bias, and learning theory (not that the generalization of MLPs is well understood, but the concepts of true vs empirical data distribution, bayes risk, etc.).

The answer to your question depends on the data and the model. If the extrinsic data dimension (e.g., you have a 10 dimensional data vector) is higher than the intrinsic dimension (maybe the data is distributed in a low dimensional subspace or manifold), then you don’t necessarily need the full data representation to solve any problem. If data don’t have lower intrinsic dimension, but the features relevant to the problem you are trying to solve are intrinsically low dimensional, you don’t necessarily need the full data representation to solve the problem.

The universal approximation theorem is great, but in practice it can be very hard to learn certain functions with certain architectures, this is why papers proposing a new architecture might get sota performance on a problem - they have found an architecture with a good inductive bias for their particular problem.

1

th3liasm t1_izkz62t wrote

Generally, no. More specific; it clearly depends on the task at hand.

1