Submitted by quantifiedvagabond t3_y992gf in MachineLearning

My ML team is looking to buy/source a dataset of videos of people performing certain niche tasks to train a business-critical model. From our research, it seems like Scale AI, Toloka, Appen, Defined AI, and Clickworker offer solutions in that space.

Has anyone used any of these before and would recommend (or recommend avoiding) them? Are we better off just running the crowdsourcing of the data in-house?

13

Comments

You must log in or register to comment.

suflaj t1_it4no84 wrote

If you have the means to record the dataset in house it's the best way. You can directly talk to the annotators and the subjects, you make sure that this data cannot be redistributed unless someone leaks it, and you will have a better grasp regarding privacy policies. It is also likely to be cheaper.

With external data it is almost impossible to prove you are allowed to have it, and this data can then just be resold to someone else, potentially a competitor.

8

seiqooq t1_it4vr7f wrote

Curious about others experiences as well. We opted to go the data-capturing infra route so I’m in the other boat.

3

DigThatData t1_it59gaq wrote

it depends on the data. considering the kind of data your working with is one of the least mature media in the analytics industry (video), it might be both significantly more cost effective and likely to produce a high-quality result if you buy the dataset. That said, if you were thinking of spinning up an in-house data annotation resource, this might be a good opportunity to go that route, and I'm sure the ML team wouldn't have any complaints if you gave them a persistent data generating resource like that.

3

seiqooq t1_it7ss31 wrote

We’re in surveillance and so vertically integrating was (fortunately) an option for us. It was certainly worth it since our org had the means, but the build vs buy trade off is always a thing

2