Viewing a single comment thread. View all comments

cjschnyder t1_ixcqy7t wrote

It sounds like the big assumption here is that scraping = taking literally anything and everything. You're correct in the sense that it's semi-curated but my guess is that any AI generator isn't doing that with teams of people each looking across the web and loading in images into a dataset. It's doing it with bots. And those bots are scraping the web.

What they pull definitely goes through a filtering process but just cause it's a " semi-curated dataset" doesn't mean that data wasn't scraped. In my day job I work in building analytics data pipelines for web traffic, I'm familiar with how vast amounts of data is aggregated and put together.

1

olemeloART t1_ixcti51 wrote

The point was that the LAION dataset, on which the model was trained, is static. it was created once and isn't continuously updated - a snapshot of a point in time. The curation was also automated, but biased for the aesthetic score ("how likely would a human find this pretty"). That's why so much art was captured in it. So, if you put an artwork out on the internet today, it will not be used by the current crop of AI art generators. Whether this is "stealing" is a matter of opinion, as all of those images were public and went into a common "melting pot". I was only specifically addressing the "scraping" statement, because there is a lot of misinformation and confusion around that - some people literally seem to think there are bots hiding in the shadows waiting to snatch their works.

Mind you, that is different from someone downloading a specific artist's collection from Artstation and purposely training a checkpoint to specifically imitate that artist. That's shitty and gross, and in my mind definitely amounts to theft and plagiarism, and many in the AI art community agree with me.

0

cjschnyder t1_ixd1qtx wrote

So that's the first time a specific dataset, or rather a dataset manufacturer, was mentioned. While that one may be static, others may not be. It should also be noted that LAION is making other datasets so while ONE of their sets might be static and completed they are continuously making updated sets.

I would also hesitate to say it's a "matter of opinion" in the stealing department. It is stealing. Something extremely common place on the internet today as to not really be noticed but just cause something is in a public space does not mean it is public domain.

What is the difference between what you posed, someone taking a specific artists work to make a dataset to copy them and a dataset that includes enough of someones art to copy them and then is used in such a manner?

1