Submitted by [deleted] t3_11yccp8 in MachineLearning
Alternative_iggy t1_jd7bgm4 wrote
I don’t typically deal in breast cancer histopathology models but I do work with medical imaging full time as my day job - if I’m reading this correctly they use the Wisconsin Breast Cancer dataset (originally released in 1995!: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic))
First question - have breast cancer histopathology evaluation techniques changed since 1995? Checking out a quick lit review - yes: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8642363/#Sec2
So is this dataset likely to be useful today? Well… we don’t know the demographics of the population, we don’t know the split of severity of tumors in the population (this could be all easy cancers and not very generalizable/ useful to what someone sees on a day to day!), and the preprocessing required would need someone to take the digital image and extract all these features which honestly probably takes the same amount of time as the pathologist looking at the image and evaluating it. Also it sort of looks like they just used the features that came with the dataset…
They report the 100% accuracy on the training set and 99% on the testing set - great, theoretically any model can get to 100% accuracy on the training set so I almost always ignore this completely when papers do this unless there is a substantial drop off between training and testing or vice versa. But next question - are these results in line with similar published results on this particular dataset? Here’s an ARXIV paper from 2019 with similar results: https://arxiv.org/pdf/1902.03825.pdf
So nothing new here… it seems it’s possible and has been previously published to get 99% accuracy on this dataset…
Next question - is procedia a good journal? It publishes conference paper proceedings with an impact factor of 0.8 (kind of low). It’s unlikely this hit a rigorous peer review process, although I don’t like to throw our conference journals just because some of the big cool clinical trial results and huge breakthroughs are dumped in places like there. But in this case it seems like two researchers trying to get a paper out and not necessarily a ground breaking discovery (people have published on this dataset before and gotten 99% with random forest before!).
Final conclusion: meh.
Viewing a single comment thread. View all comments