BM-is-OP t1_jccin4h wrote on March 15, 2023 at 9:12 PM

When dealing with an imbalanced dataset, I have been taught to oversample on only the train samples and not the entire dataset to avoid overfitting, however this was for structured text based data in pandas using simple models from sklearn. However is this still the case for image based datasets that will be trained on a CNN? I have been trying to oversample only the train data by applying augmentations to the images. However, for some reason I get a train accuracy of 1.0 and a validation accuracy of 0.25 which does not make sense to me on the very first epoch, where the numbers dont really change as the epochs progress which doesn't make sense to me. Should the image augmentations via oversamping be applied to the entire dataset? (fyi I am using PyTorch)

josejo9423 t1_jchq421 wrote on March 16, 2023 at 10:24 PM

I am not quite familiar with deep learning but don’t you have loss function where you can maximize recall precision or AUC? I believe accuracy would not apply in this case since you have imbalanced dataset, also over sampling as it dealed in random forest you are making up new images i don’t know how good is that, why don’t you try under sampling better or weight adjustments?