Submitted by qazokkozaq t3_102983u in MachineLearning

Hi,

I faced a classification problem like this: Given a measurement of 18K different variables of 42 samples, each sample is classified as class_0 or class_1, divided near equally (19 belongs to class_0, 23 belongs to class_1) what is the right approach to eliminate these features to a minimum level, so that the classifier is still predicting correct classses.

I do not provide any domain knowledge for now, but can hint a little bit more, if needed.

6

Comments

You must log in or register to comment.

ResponsibilityNo7189 t1_j2sgk3g wrote

Decistion trees would help in this precise case, by selecting the right features.

Random forests to improve results.

1

magical_mykhaylo t1_j2sj2fc wrote

This is a very general issue, often called "the curse of dimensionality", or the "short and wide" problem. There are a number of ways to do it, that fall generally under the umbrella term of "dimensionality reduction". It's really tricky not to over-fit these types of models, but here are some things you can try:

You can reduce the number of features using Principal Component Analysis (PCA), Independent Component Analysis (ICA), or UMAP. Using PCA or ICA, and speaking in broad terms, you are not training your model on the inividual variables themselves, but rather linear combinations of those variables as "latent variables".

You can select the most relevant features, using feature or variable selection prior to training your algorithm. This can be done in the context of Random Forests using GINI coefficients or any number of other similar metrics.

If you are training a linear model, such as Linear Discriminant Analysis (LDA) there are generally higher-dimensional variants that incorporate elastic net regularisation to better handle problems with dimensionality. Look up "spare regression" for more information. Some of these algorithms also use Partial Least Squares (PLS) as a way around it, but it has fallen out of fashion in most fields.

If you are building a neural network (generally a bad idea if you have fewer samples), you might consider using regularisation coefficients for the hidden layers.

7

xx14Zackxx t1_j2tnl3k wrote

Sounds like a good fit for an SVM (support vector machine) to me.

2