Viewing a single comment thread. View all comments

No_Remote5392 t1_j2nltj5 wrote

Hello , i'm trying to develop a 1d cnn with gene expression as input , to predict cancer type .
The problem is that my label are very unbalanced , and i am wondering what should i do ?
Squamous cell carcinoma , NOS : 368
Transitional cell carcinoma : 66
Papillary transistional cell carcinoma : 1
Carcinoma NOS : 1
Papillary transitional cell carcinoma : 1
what should i do with the label with only 1 observation ?
Thank you very much

1

jakderrida t1_j2zy3s2 wrote

I would recommend considering the following strategies to handle imbalanced labels in your dataset:

Oversampling: You can oversample the minority classes by generating synthetic examples or by sampling with replacement from the minority classes. This can help to balance the class distribution and improve the model's performance on the minority classes.

Undersampling: You can undersample the majority classes by randomly sampling a smaller number of examples from the majority classes. This can help to balance the class distribution and prevent the model from being biased towards the majority classes.

Weighted loss: You can assign higher weights to the minority classes in the loss function to give them more influence on the model's learning. This can help to balance the class distribution and improve the model's performance on the minority classes.

Class-specific metrics: You can use metrics that are specifically designed to evaluate the model's performance on imbalanced datasets, such as the F1 score or the AUC (Area Under the Curve) of a precision-recall curve.

In your particular case, you may want to consider oversampling or using weighted loss, since you have only one example for some of the minority classes. It may also be helpful to combine these strategies to achieve the best results.

1

comradeswitch t1_j33zsw6 wrote

Do you have data with no cancer? It's going to require careful treatment of the categories with only one example, but one-shot learning is a topic of great research that describes this problem exactly. Starting there should be helpful.

Also, you have "transistional" and "transitional" listed with 1 each- if that typo is in the original data, you should fix that! And then you'll have 2 examples.

Unfortunately, the answer here may be "acquire more data", because you have many categories for the total samples you have as well as multiple with 1 example only.

1