Far-Butterscotch-436 t1_j0a9083 wrote on December 15, 2022 at 4:29 AM

5% imbalance isn't bad. Just use a cost function that uses a metric to handle imbalance. Ie, the weighted average binomial deviance and you'll be fine.

Also you can create downsampling ensemble to compare performance and compare. Don't downsample to 50/50, try for at least 10%

You've got a good problem, lots of observations with few features

trendymoniker t1_j0acn6e wrote on December 15, 2022 at 5:03 AM

👆

1e6:1 is extreme. 1e3:1 is often realistic (think views to shares on social media). 18:1 is a actually a pretty good real world ratio.

If it were me, I’d just change the weights for each class in the loss function to get them more or less equal.

190m examples isn’t that many either — don’t worry about it. Compute is cheap — it’s ok if it takes more than one machine and/or more time.

hopedallas OP t1_j0ailbn wrote on December 15, 2022 at 6:04 AM

Thanks for the hint. Sorry not sure what you mean by “try for 10%”?

Far-Butterscotch-436 t1_j0ajlfn wrote on December 15, 2022 at 6:15 AM

When you downsample try to get at least 1:10 ratio (minority:majority)