Viewing a single comment thread. View all comments

biophysninja t1_j0a4nc2 wrote

There are a few ways to approach this depending on the nature of the data, complexity, and compute available.

1- using SMOTE https://towardsdatascience.com/stop-using-smote-to-handle-all-your-imbalanced-data-34403399d3be

2- if your data is sparse you can use PCA or Autoencoders to reduce the dimensionality. Then follow up with SMOTE.

3- Using GANs to generate negatives samples is another alternative.

−1

Far-Butterscotch-436 t1_j0a8ny2 wrote

Regarding 2, there are only 500 features, dimension reduction not needed.

1 and 3 are last resorts

1

shaner92 t1_j0amnbc wrote

  1. Has anyone ever seen SMOTE give good results in real world data??
  2. Depends what the 500 features are, you could very well benefit from dimension reduction, or at least pruning some features, if they are not all equally useful. That is a separate topic though
  3. Lot of work to create fake data when he already has that amount

Playing with the loss functions/metrics is probably the best way to go as you ( u/Far-Butterscotch-436 ) pointed out.

3

daavidreddit69 t1_j0b5292 wrote

  1. I believe not, it's just a concept to me, but not a solving method in general
2