I am thinking about the relative "importance" of the variables in a given data set for a general set of classifiers, and in particular the role that entropy could play here. Below, I'll sketch a heuristic "pseudo argument" that could perhaps motivate the use of entropy to reduce the number of variables.

For simplicity, let's assume that we have only three variables, so the vectors in the data set have the form (x, y, z) and are labelled as either 0 or 1. Let's now investigate the role of the z-variable in terms of entropy. Recall that we consider the coordinate hyperplanes H(z, c):={(x,y,z): z=c} for different values c, and we compute the entropies, e(z, c, L), e(z, c, R), defined by the probability distribution of the labelled vectors on both sides of the hyperplane. We then use the probabilities, p(z, c, L) and p(z, c, R), of data vectors being on the left- and right side of the hyperplane, respectively, to compute the expected entropy for the hyperplane H(z, c) as the weighted average p(z, c, L)*e(z, c, L)+p(z, c, R)*e(z, c, R).
The hyperplane above is considered to carry important information, in terms of entropy, if the expected entropy is lower than the entropy defined by the probability distribution of the labelled vectors in the full space. The bigger the decrease in entropy is, the better is the hyperplane considered to be in this sense.

Now, recall that the lower the entropy is for a probability distribution, the more unequal are the probabilities, that is, the labelled vectors are less mixed. So, a good enough hyperplane serves as a (rough) classifier.

We now look at a general classifier given by a differentiable function f(x,y,z) of three variables. We predict a point (x,y,z) to be positive if f(x,y,z) is greater or equal to zero, and to be negative otherwise. So, the hypersurface
H(f,0):={(x,y,z): f(x,y,z)=0} separates the positive predictions from the negative predictions.

We now assume that all the hyperplanes defined by the z-variable and possible values of c are of little importance for reducing the entropy, and we assume further that f, on the other hand, is a "good" classifier. Let's fix a point (a,b,c) on the hypersurface H(f,0) and look at the tangent space, T, of this hypersurface at (a,b,c). Sufficiently close to (a,b,c), the hypersurface can be approximated by its tangent space T, and for data vectors close enough to (a,b,c), we could use the tangent space T instead of H(f,0) as a classifier. So, we predict vectors to be positive or negative depending on which side of T they are. Now, compare the tangent space T to the coordinate hyperplane H(z,c). We have assumed that H(z, c) is a "bad" classifier, at least globally. So, if (a, b, c) is a general enough point on H(f, 0), the hyperplane H(z, c) should also be a "bad" classifier locally. Since f is assumed to be "good", the tangent space T should not be parallel to H(z, c), meaning that the unit normal vector, n, to H(f,0) at (a,b,c) - the gradient of f at (a,b,c) - should not be parallel to the z-axis. In fact, n should have a very small z-coordinate as it should be far from the unit normal vector of H(z,c), which is (0,0,1).

But if the gradient of f is very small in z-direction (for general enough points on H(f,0)), this means that the hypersurface H(f,0) is close to "being constant" in the z-direction. So, for points (x,y,z) close enough to H(f,0), the value of f depends (almost) only on x and y.

It would be tempting to conclude that the z-variable plays a very little role for classifier functions of the type f above, and that we could in fact instead look for classifier functions of two variables. In particular, when training neural nets, we could perform entropy computations for all variables, and sufficiently many threshold values, in order to reduce the number of variables of the training vectors. However, I have not managed to convince myself by the above "argument": in particular, the passing from "global" classifier, to "local" classifier for "general enough" points on the hypersurface seems quite shaky, and would depend very much on the distribution of the training vectors in the full 3-dimensional space. (I suppose the "argument" works better if these are very evenly distributed.)

I would be interested in hearing your thoughts on this problem, in particular for neural net classifiers. Do you know any results that support this direction of thought?

Comments

UnusualClimberBear t1_iylwrfc wrote on December 2, 2022 at 10:55 AM

ID3 classically builds a tree by maximizing entropy gains in leaves, thus removing some irrelevant variables.

You may also be interested in energy-based models.

MrHumanist t1_iymei70 wrote on December 2, 2022 at 2:03 PM

Groupby -> entropy is used as a feature to estimate uncertainty for a group.

YamEnvironmental4720 OP t1_iymfub5 wrote on December 2, 2022 at 2:14 PM

Do you mean that uncertainty means almost equality of probabilities? For entropy, we usually group the space into halfspaces separated by a coordinate hyperplane. But any hypersurface, such as the zero level of a function f, also does this. A classifier function f whose zero level hypersurface yields a splitting of the full space that does not significantly reduce the entropy would probably be a bad classifier by other metrics also.

Oceanboi t1_iypx661 wrote on December 3, 2022 at 5:56 AM

Half spaces, hyper planes, hmm. It seems as though my current understanding of entropy is very limited. Could you link me some relevant material so I can understand what a "zero level hypersurface" is? I only have ever seen simple examples of entropy / gini impurity for splitting random forest so I'm interested in learning more.

YamEnvironmental4720 OP t1_iyrgm9g wrote on December 3, 2022 at 4:21 PM

I could recommend the lecture on tree classifiers by Nando de Freitas in his course on Machine Learning. It's all on YouTube.

webbersknee t1_j0659x9 wrote on December 14, 2022 at 10:07 AM

Working through some of your arguments on the xor problem in two dimensions should help clarify things.

In particular, all axis aligned hyperplanes do not reduce entropy in your construction, but (for instance) the hyperplane x=0 is locally a perfect classifier almost everywhere.