Submitted by **YamEnvironmental4720** t3_zaifed
in **MachineLearning**

I am thinking about the relative "importance" of the variables in a given data set for a general set of classifiers, and in particular the role that entropy could play here. Below, I'll sketch a heuristic "pseudo argument" that could perhaps motivate the use of entropy to reduce the number of variables.

For simplicity, let's assume that we have only three variables, so the vectors in the data set have the form (x, y, z) and are labelled as either 0 or 1. Let's now investigate the role of the z-variable in terms of entropy. Recall that we consider the coordinate hyperplanes H(z, c):={(x,y,z): z=c} for different values c, and we compute the entropies, e(z, c, L), e(z, c, R), defined by the probability distribution of the labelled vectors on both sides of the hyperplane. We then use the probabilities, p(z, c, L) and p(z, c, R), of data vectors being on the left- and right side of the hyperplane, respectively, to compute the expected entropy for the hyperplane H(z, c) as the weighted average p(z, c, L)*e(z, c, L)+p(z, c, R)*e(z, c, R).

The hyperplane above is considered to carry important information, in terms of entropy, if the expected entropy is lower than the entropy defined by the probability distribution of the labelled vectors in the full space. The bigger the decrease in entropy is, the better is the hyperplane considered to be in this sense.

Now, recall that the lower the entropy is for a probability distribution, the more unequal are the probabilities, that is, the labelled vectors are less mixed. So, a good enough hyperplane serves as a (rough) classifier.

We now look at a general classifier given by a differentiable function f(x,y,z) of three variables. We predict a point (x,y,z) to be positive if f(x,y,z) is greater or equal to zero, and to be negative otherwise. So, the hypersurface

H(f,0):={(x,y,z): f(x,y,z)=0} separates the positive predictions from the negative predictions.

We now assume that all the hyperplanes defined by the z-variable and possible values of c are of little importance for reducing the entropy, and we assume further that f, on the other hand, is a "good" classifier. Let's fix a point (a,b,c) on the hypersurface H(f,0) and look at the tangent space, T, of this hypersurface at (a,b,c). Sufficiently close to (a,b,c), the hypersurface can be approximated by its tangent space T, and for data vectors close enough to (a,b,c), we could use the tangent space T instead of H(f,0) as a classifier. So, we predict vectors to be positive or negative depending on which side of T they are. Now, compare the tangent space T to the coordinate hyperplane H(z,c). We have assumed that H(z, c) is a "bad" classifier, at least globally. So, if (a, b, c) is a general enough point on H(f, 0), the hyperplane H(z, c) should also be a "bad" classifier locally. Since f is assumed to be "good", the tangent space T should not be parallel to H(z, c), meaning that the unit normal vector, n, to H(f,0) at (a,b,c) - the gradient of f at (a,b,c) - should not be parallel to the z-axis. In fact, n should have a very small z-coordinate as it should be far from the unit normal vector of H(z,c), which is (0,0,1).

But if the gradient of f is very small in z-direction (for general enough points on H(f,0)), this means that the hypersurface H(f,0) is close to "being constant" in the z-direction. So, for points (x,y,z) close enough to H(f,0), the value of f depends (almost) only on x and y.

It would be tempting to conclude that the z-variable plays a very little role for classifier functions of the type f above, and that we could in fact instead look for classifier functions of two variables. In particular, when training neural nets, we could perform entropy computations for all variables, and sufficiently many threshold values, in order to reduce the number of variables of the training vectors. However, I have not managed to convince myself by the above "argument": in particular, the passing from "global" classifier, to "local" classifier for "general enough" points on the hypersurface seems quite shaky, and would depend very much on the distribution of the training vectors in the full 3-dimensional space. (I suppose the "argument" works better if these are very evenly distributed.)

I would be interested in hearing your thoughts on this problem, in particular for neural net classifiers. Do you know any results that support this direction of thought?

UnusualClimberBeart1_iylwrfc wroteID3 classically builds a tree by maximizing entropy gains in leaves, thus removing some irrelevant variables.

You may also be interested in energy-based models.