iknowjerome

iknowjerome OP t1_ivuua0z wrote

>company

Looking forward to hearing what you think. Just to be clear. I'm always reluctant to calling one dataset better than another because it always depends on what you're trying to achieve with it. With Sama-Coco, we were trying to fix some of the misclassification errors when possible but we also put a significant amount of effort in drawing precise polygons around the objects of interest because of experiments we are currently running. And, of course, we wanted to capture as many instances of the COCO classes as possible. This resulted in a dataset with close to 25% more object instances than the original COCO 2017 dataset. But it's not to say that we solved all "errors" in COCO. :)

24

iknowjerome OP t1_ivu165k wrote

It really depends on what you are trying to achieve, what your budget is, and where you are in your model development cycle.
Nevertheless, I would recommend starting in self-service mode with the simplest tool you can find. This might be something like CVAT, though there are a number of other options (paid, free, SaaS, etc.) out there that a simple google search will return. Once you're ready to scale, you might want to consider handing off your annotations to specialized company like Sama. And yes, we also do 3D annotations. :)
(disclaimer: I work for Sama)

3

iknowjerome OP t1_ivtti9x wrote

Every dataset has errors and inconsistencies. It is true that some have more than others, but what really matters is how that affects the end goal. Sometimes, the level of inconsistencies doesn't impact model performance as much as one would expect. In other cases, it is the main cause of a poor model performance, at least in one area (for instance, for a specific set of classes). I totally agree with you that companies that succeed in putting and maintaining AI models in production pay particular attention to the quality of the datasets that are created for training and testing purposes.

12