Submitted by hopedallas t3_zmaobm in MachineLearning

I am working on a problem where the negative/0 label to postie/1 label ratio is 180MM/10MM. The data size is around 25GB and I have >500 features. Certainly, I don't want to use all 180MM rows of majority class to train my model due to computational limitations. Currently, I simply perform an under-sampling from majority class. However, I have been reading that this may cause loss of the useful information or cause difficulties for determining the decision boundary between the classes (see https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/). When I do the under-sampling, I try to make sure that distribution of my data stays the same. I am wondering if there is a better way to handle this?

20

Comments

You must log in or register to comment.

Far-Butterscotch-436 t1_j0a9083 wrote

5% imbalance isn't bad. Just use a cost function that uses a metric to handle imbalance. Ie, the weighted average binomial deviance and you'll be fine.

Also you can create downsampling ensemble to compare performance and compare. Don't downsample to 50/50, try for at least 10%

You've got a good problem, lots of observations with few features

20

trendymoniker t1_j0acn6e wrote

👆

1e6:1 is extreme. 1e3:1 is often realistic (think views to shares on social media). 18:1 is a actually a pretty good real world ratio.

If it were me, I’d just change the weights for each class in the loss function to get them more or less equal.

190m examples isn’t that many either — don’t worry about it. Compute is cheap — it’s ok if it takes more than one machine and/or more time.

9

hopedallas OP t1_j0ailbn wrote

Thanks for the hint. Sorry not sure what you mean by “try for 10%”?

2

skelly0311 t1_j0adie9 wrote

What algorithm are you using? If it learns in an iterative fashion, such as gradient descent, you can downsample a different random sample of the class that has more training examples every epoch of feed forward/backprop, thus not losing any information from the class that has more data.

I currently do this with multi label classification problems in NLP, where the classes are much more skewed than your use case.

3

hopedallas OP t1_j0c0eui wrote

Im using both random forest and xgboost. For your NLP problem, you give higher weighs for each epoch to the sparse classes?

1

JackandFred t1_j0a1gv2 wrote

If you think the real world data will be similar to your samples it's fine. But that's unlikely if you got this dataset that's so skewed. Loos up alternative metrics like F score etc. so that you can try to scale what's important metrics when training (false positive vs false negative etc.)

what you linked there is algorithms for imbalanced classification, usually the same algorithm is fine, but you want a different loss metric.

2

katerdag t1_j0dp1zw wrote

You could look into anomaly detection strategies and see if any fit your type of data / your use case.

2

stu_art0 t1_j0az170 wrote

It’s not extreme at all… for fraud detection cases you can easily get a ratio of 500:1…

1

bimtuckboo t1_j0b1lti wrote

The issue described in the article you linked only becomes relevant when you are throwing away data (that you otherwise would have trained on) purely to rectify class imbalance. If you can't train on it anyway due to computational limitations, even if the classes were 50/50 balanced, then there is nothing else to be done.

Of course more data can often lead to better performance and if you find your model to be below par then you may want to explore ways to engineer around whatever computational limitations you are encountering so that you can train on more data. In that case you may want to revisit your approach to rectiifying the class imbalance but don't do it if you don't need to.

Ultimately, anytime you are developing a model and you don't know what to do next, check if the model's performance is acceptable as is. You might not need to do anything.

1

biophysninja t1_j0a4nc2 wrote

There are a few ways to approach this depending on the nature of the data, complexity, and compute available.

1- using SMOTE https://towardsdatascience.com/stop-using-smote-to-handle-all-your-imbalanced-data-34403399d3be

2- if your data is sparse you can use PCA or Autoencoders to reduce the dimensionality. Then follow up with SMOTE.

3- Using GANs to generate negatives samples is another alternative.

−1

Far-Butterscotch-436 t1_j0a8ny2 wrote

Regarding 2, there are only 500 features, dimension reduction not needed.

1 and 3 are last resorts

1

shaner92 t1_j0amnbc wrote

  1. Has anyone ever seen SMOTE give good results in real world data??
  2. Depends what the 500 features are, you could very well benefit from dimension reduction, or at least pruning some features, if they are not all equally useful. That is a separate topic though
  3. Lot of work to create fake data when he already has that amount

Playing with the loss functions/metrics is probably the best way to go as you ( u/Far-Butterscotch-436 ) pointed out.

3

daavidreddit69 t1_j0b5292 wrote

  1. I believe not, it's just a concept to me, but not a solving method in general
2