I am working on a medical ML project and my advisor would not like to publish our dataset. I would like to publish our results to a top-tier ML conference. Would this affect us during the review process? If so, are there any ways to mitigate against this like also including results on separate publicly available datasets?

Just to note, not publishing the research dataset seems much more common in medical publication venues.

Comments

You must log in or register to comment.

[deleted] t1_j3yaz0i wrote on January 11, 2023 at 10:11 PM

#1,335,979

You may not publish your dataset but you should:

benchmark on a public dataset
benchmark other approaches on your private dataset

cosentiyes t1_j3yxw3t wrote on January 12, 2023 at 12:42 AM

#1,337,012

Replying to [deleted] (#1,335,979)

additionally, release model checkpoints a code that can be applied to those public datasets to validate your claims

newperson77777777 OP t1_j3z0dot wrote on January 12, 2023 at 1:00 AM

#1,337,127

Replying to [deleted] (#1,335,979)

Thanks for the suggestion!

newperson77777777 OP t1_j3z0ep3 wrote on January 12, 2023 at 1:00 AM

#1,337,131

Replying to cosentiyes (#1,337,012)

Thanks for the suggestion!

chatterbox272 t1_j40jame wrote on January 12, 2023 at 9:29 AM

#1,339,713

Not publishing the dataset is becoming less common as we start inching our way slowly to reproducible science. Public code with public data is the simplest form of reproducible research, where we can re-run your experiments with the same code and should get the same result (modulo some extremely low-level randomness or hardware differences that we may not be able to control).

That alone isn't enough to kill a paper, but it doesn't help. As another commenter said, showing your approach on public datasets and other approaches on your dataset will help, as it gives the rest of the community something that is reproducable.

It's more common in medical venues because of a few reasons:

Difficulties around safely releasing medical data. Proper anonymisation and informed consent.
It is more common in medical science to go for a higher level of reproducibility, where the same or a similar study will be done on a different population (i.e. same method, different data). This is pretty uncommon in ML, it's hard to get papers accepted in this format.

Insighteous t1_j430uuc wrote on January 12, 2023 at 8:47 PM

#1,344,243

Replying to chatterbox272 (#1,339,713)

Publishing everything is a good thing. At the moment I am trying to reproduce some results of a paper and have to work with „we created X datasets by three methods“. And NO WHERE in the paper it is stated what these three methods are. Also no code.

It is so annoying. Cannot put it in words.

newperson77777777 OP t1_j437zvq wrote on January 12, 2023 at 9:30 PM

#1,344,620

Replying to chatterbox272 (#1,339,713)

Thanks for your perspective.