Viewing a single comment thread. View all comments

no_witty_username t1_iyiqgp4 wrote

Data curation is the biggest bottle neck for making AAA quality models.

4

fourcornerclub OP t1_iylpygx wrote

u/no_witty_username and yet the standard in data sourcing still seems to be "let me see what's open source, and what I can scrape from the internet, and then I'll tune the model from there". Makes no sense to me!

2

FlattenLayer t1_iylz5rc wrote

CTR model was built to predict click-through rates in recommendation systems like TikTok and google and the model was fed tens of billions of samples from the exposure logs. In this case, the most important thing is keeping the exposure log clean. But it's not easy because there is a complex and long pipeline from the exposure log to training samples.

2

cantfindaname2take t1_iym5ipx wrote

Is it though? One thing that comes back up again is the comparison to human learning. Do humans get clean training samples? I like to think not that. Instead humans learn how to separate signal from noise much better, and also learn how to model hidden causes.

2

no_witty_username t1_iymgyhr wrote

Humans do get clean data when learning. Here is what bad data looks like for humans. Ocular degeneration, deafness, neurological disorder, etc.... Children who have various sensory deformities or diseases that cause damage to their sensory organs all have severe learning difficulties. Same goes with machines when they are presented shit data. The machines ability to understand anything is dependent on many factors, and some of the most important factors are presenting it with data it was built to process. Showing a machine a picture of a bad image crop of a person where the top half of said person is fully missing and the image displayed only neck down and telling it that's what a person is is bad data as much as showing an image to a child of anything with ocular degeneration . The image is severely distorted and while the brain of the child is quite capable of proper learning, its sensors aka the eyes are presenting shit data, so no proper learning will occurs.

5