Submitted by iknowjerome t3_yrfzcf in MachineLearning

Hi everybody, here is the complete relabelling of the COCO 2017 dataset for segmentation. This is all free of charge, un-gated, and was done by Sama, a labelling company for CV data. The dataset is available on the Sama website under a Creative Commons license: https://www.sama.com/sama-coco-dataset/. I would also love to hear any feedback.
Disclaimer: I work for Sama

266

Comments

You must log in or register to comment.

Mozillah0096 t1_ivtqxxc wrote

Thanks for the efforts .I have read on one article of medium that coco dataset has lot of errors in it .Is it true ?

17

that_username__taken t1_ivts6r0 wrote

yup, there are a lot of inconsistencies there. Engineers never learn about the importance of annotation/data quality, in best case they skim over the topic. In reality successful companies spend most of their budget on data annotation.

22

iknowjerome OP t1_ivtti9x wrote

Every dataset has errors and inconsistencies. It is true that some have more than others, but what really matters is how that affects the end goal. Sometimes, the level of inconsistencies doesn't impact model performance as much as one would expect. In other cases, it is the main cause of a poor model performance, at least in one area (for instance, for a specific set of classes). I totally agree with you that companies that succeed in putting and maintaining AI models in production pay particular attention to the quality of the datasets that are created for training and testing purposes.

12

that_username__taken t1_ivttp7n wrote

Really depends on the size and the budget of the project. If both are large enough you should really outsource this task and superannotate has both a great platform(free for academics,I used this) and according to g2 they are the highest rated. A friend of mine told me that if you want automotive data then scale is more specialized there

2

Ulfgardleo t1_ivtuenl wrote

It is such a rare opportunity to get a better labeled dataset. If i was still working it, i would use this dataset to evaluate noisy label techniques.

40

iknowjerome OP t1_ivtw0xs wrote

The trick is not to wait for the end of the cycle to make the appropriate adjustments. And there are now a number of solutions on the market that help with understanding and visualizing your image/video data and labels.

5

DigThatData t1_ivtx4y1 wrote

the summary reporting you offer describes some of the net differences, but I'd be interested to see numbers describing the distribution of what your team considered to be incorrect labels in the original dataset.

3

iknowjerome OP t1_ivtytgc wrote

That's a great suggestion. We will eventually post more detail about this. It will make more sense when divulged at the same time we report on the results of some data quality experiments we are currently running. Stay tuned! :)

2

iknowjerome OP t1_ivu165k wrote

It really depends on what you are trying to achieve, what your budget is, and where you are in your model development cycle.
Nevertheless, I would recommend starting in self-service mode with the simplest tool you can find. This might be something like CVAT, though there are a number of other options (paid, free, SaaS, etc.) out there that a simple google search will return. Once you're ready to scale, you might want to consider handing off your annotations to specialized company like Sama. And yes, we also do 3D annotations. :)
(disclaimer: I work for Sama)

3

Gaudy_ t1_ivuhtyx wrote

Thanks a lot, feels like every 6 months some company writes some article on how they fixed a great number of annotation errors of various public datasets. Yet they always fail to release it, not so this time, looking forward to testing it out.

34

iknowjerome OP t1_ivuua0z wrote

>company

Looking forward to hearing what you think. Just to be clear. I'm always reluctant to calling one dataset better than another because it always depends on what you're trying to achieve with it. With Sama-Coco, we were trying to fix some of the misclassification errors when possible but we also put a significant amount of effort in drawing precise polygons around the objects of interest because of experiments we are currently running. And, of course, we wanted to capture as many instances of the COCO classes as possible. This resulted in a dataset with close to 25% more object instances than the original COCO 2017 dataset. But it's not to say that we solved all "errors" in COCO. :)

24

adamzhangchao t1_ivwhjj3 wrote

Great work! While I am curious about how did you recall the error labels and fix them.

3

canyonkeeper t1_ivxnppt wrote

Can it detects the bike features tho?

1

i_sanitize_my_hands t1_ivxozuh wrote

Thanks Sama ! Had the opportunity of working with you all in a professional capacity in the past. Was great :)

3

iknowjerome OP t1_ivylty6 wrote

We had our annotating associates look at every image and label and correct any mistakes they would find. In some cases, they would start from existing annotations. In other cases, they would decide to start from scratch.

3