Submitted by von-hust t3_11jyrfj in MachineLearning

Using our new method, we found that at least 25% of the LAION-2B-en dataset are near duplicates (wrt to image data). You may find the de duplicated set and code to verify result here:

https://github.com/ryanwebster90/snip-dedup

In addition, we used the duplicate histograms, and found a handful of “verbatim copied” generated images by stable diffusion, with much less resources than deepmind (our process runs on a standard computer), like the following

stable diffusion verbatim copy

disclaimer This is a fairly new result, we’ll publish once we’ve done more verification. Take it with a grain of salt. You are welcome to explore and verify the deduplicated set we’ve released.

375

Comments

You must log in or register to comment.

JrdnRgrs t1_jb53xvx wrote

Very interesting, so what is the implication for stable diffusion?

Does this mean that if the data set was corrected for these duplicated images that a corrected model using this data set would be of even "higher quality"? Can't wait

75

AuspiciousApple t1_jb557q8 wrote

Not obviously so.

First, de-duplicating text data didn't help much in the cramming paper. Second, even if the images are duplicates, the captions might be different so you still learn more than if you only had one copy of each image.

Finally, even with exact copies of text and image, it would just weigh those images more heavily than the rest - which could harm performance, not matter at all, or even help performance (for instance if those images tend to be higher quality/more interesting/etc.)

98

AuspiciousApple t1_jb6gzcd wrote

Can't wait to see this replicated!

6

astrange t1_jb6hn1a wrote

StableDiffusion claims they also dedupe following this, in SD2.X at least.

Though, deduplicating images feels incomplete to me - what if the same thing appears in different images? That's kind of what you want, but also not what you want.

11

PacmanIncarnate t1_jb5vofk wrote

You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.

The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.

9

midasp t1_jb5p7v0 wrote

In my experience, all of these issues will occur. It's going to vary from model to model. To be certain, you still have to make an objective test to determine whether the impact is positive or negative, and measure the significance of the impact.

2

TikiTDO t1_jb5f4p2 wrote

Honestly, the biggest problem with the dataset isn't the duplicates. It's the fact that most of the annotations are kinda crap. You know the saying an image is worth a thousand words. That may be too much for SD, but it will happily chew on 50-75 tokens. SD really wants a bunch of content it can parse on in order to understand concepts and how those concepts relate to each other, but most LAION annotations are short and simple.

From my experience, refining the model with a few hundred images with proper long-form annotations describing what you want can go a long way, even for complex things like hands.

25

Jurph t1_jb7kym3 wrote

I wonder whether the author of AUTOMATIC1111 could allow people to opt-in and send their training folder(s) of image-caption pairs to a central repository for use in a mega fine-tuning data set.

3

zaptrem t1_jb8i4cr wrote

> the author of AUTOMATIC1111

…you mean AUTOMATIC1111? That’s their name.

2

Jurph t1_jb9e3cw wrote

Yes, but if you read to the end of the book, you find out that actually, the Doctor is the real monster.

2

alushamir t1_jb9fdgy wrote

I agree that mislabels are also an issue.
You can see some examples in this video:
https://www.youtube.com/watch?v=s6qamoFzyis&t=7s

We have used fastdup to analyse Laion-400M.

1

TikiTDO t1_jb9thji wrote

That's interesting. More similarity than I expected.

That said, with my workflow I tend to not worry too much about dupes, since they are likely to end up with different labels focusing on different things. That said, my approach also requires a lot more manual steps and intervention, so I can definitely see how such a dedupe may help with the current setup.

In case anyone's interested, here's what I find works for me:

  1. First I started with a few hundred manually annotated images. I then used those to fine tune a version of BLIP VQA.

  2. Whenever I have new images, I have a script that will interrogate VQA for details about the picture (things like camera angle, number of people, the focus of the picture, and whether it satisfies any extra training criteria I have), and then get a gradCAM of key elements I may want to focus on. This will generate a JSON file with a lot of image information.

  3. I can then use the JSON file along with a language model to generate multiple information dense prompts that should correspond with the image.

  4. Based on my training goals at the time, I send an image into an generic approval queue where I can validate a few hundred images a day before sending it to my generic training location, in addition to that I may also send it into a specialised queue if I'm trying to train up a specific concept or idea. For example I'm working on hands at the moment. It can still obviously use some more work (It's still not sure what all the fingers are called and how they move), but there's no way I'd be able to get something like that out of vanilla SD 2.1. Note, it's also pretty important to have a good variety of related concepts in a specialised set; so for example, for hands you want old hands, young hands, man's hands, woman's hands, hand bones, hand muscles, pictures of people practising drawing hands, pictures of people doing things with hands, all annotated with some connecting terms, but also adding additional context that might not be available in other places.

  5. I will alternate small number of higher lr training cycles with new concepts and a lower batch size, and then a long low lr run for the larger training set with a higher batch size. This way I can constantly validate if it's learning the ideas I want to, and then reinforce those ideas. This has the secondary bonus that once I've validated the individual concept I generally won't have to worry about it if I ever restart training, and even if I do I can always pick out a few hundred images to refine things.

It's obviously a much slower process than just scraping the internet for a bunch of images and shoving them into CLIP, but it's reliable enough that I have tens of thousands of images at this point, which gets me some really nice results.

Incidentally, with the gradCAM data I can also use higher res pictures, which I can subdivide in zoomed in portions for studying particular topics.

2

von-hust OP t1_jb55rhp wrote

I think the first version of SD is trained with duplicates, and they made some effort to remove duplicates for training v2 (people on discord are saying pHash or something ismilar). I suppose it'd be interesting to see if the same prompts can be verbatim copied.

11

PrimaCora t1_jb5rojz wrote

The quality issues came more from the fact they square cropped everything. A photo of a guy wearing a crown isn't great to learn from when he's looking like King Charles I.

The duplication just leads to over fitting. If you train a model on one picture, it's going to make that picture pretty dang good. If you train on millions and have a dozen duplicates, it's going to favor those duplicates pretty heavily. And other combinations, like a duplicate photo that has the unique keyword Zhanfuur, would be the only thing it could make it you just input that keyword.

If they retrain with the new bucketing, it should alleviate the crop issue. Deduplication would help reduce over fit. Both together should lead to better quality, size variation, and variety of text input (hopefully for that last one).

4

LetterRip t1_jb5bgvj wrote

Greatly appreciated, you might run it on aesthetic and 5B also.

15

von-hust OP t1_jb5ef3f wrote

I would, but I don't have the CLIP features. I'll release some training
code so that it's possible for others to train their indices. The method
should scale to 5B, even on a single node, you'll just need more RAM.

7

Albino_Jackets t1_jb5cq6x wrote

The duplicates aren't perfect duplicates and are added to create more robust model results. Like how an image of a giraffe rotated 90 degrees is still a giraffe even though the patterns are no longer the same. Same thing applies with the Stallone pic, the noise and errors help the model deal with suboptimal image quality

13

von-hust OP t1_jb5fjqo wrote

The stallone pic is generated by SD, I'm misunderstanding something. There are false positives, but they shouldn't be "rotated 90 degrees" as you say. The dup's mostly match raw clip feature duplicates.

15

InterlocutorX t1_jb6iw7y wrote

>The duplicates aren't perfect duplicates and are added to create more robust model results

This is incorrect and anyone who looks at the LAION5b aesthetic set can tell pretty easily. It's got easily viewable identical copies of images.

https://imgur.com/a/Mg2xZcT

And the noisy Stallone was an SD image, not an image from the dataset.

[I looked at the images it has for Henry Cavil and 6 out of 24 images are the exact same Witcher promo shot. Which is a quarter of the images it has of Cavil.]

Feel free to look for yourself:

https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/

13

SaifKhayoon t1_jb54pnw wrote

Is this why some checkpoints / safetensors make for better results than stable diffusion's 1.5 and 2.1 weights?

Was LAION-2B used to train the base model shared by all other "models"/weights?

5

True_Toe_8953 t1_jb72i4c wrote

> Is this why some checkpoints / safetensors make for better results than stable diffusion's 1.5 and 2.1 weights?

I think this is because of a tradeoff between stylistic range and quality. Your model is only so big, so the more styles the less parameters available for each.

The base SD model is capable of a very wide range of styles, including a lot of abstract styles that no one ever uses. Most fine-tuned models only support a handful of popular styles (usually anime, digital paintings, or photographs) and other styles are merged with the main style and lost.

MidJourney has a wider range than most fine-tuned SD models but appears to be making the same tradeoff.

2

[deleted] t1_jb5iei8 wrote

[deleted]

1

von-hust OP t1_jb5l3r0 wrote

well just want to be clear these are actually near duplicates (like image should only differ up to compression, small artifacts or even imperceptible differences). ill try to be more explicit by what i mean by duplicate in the github.

9

clueless1245 t1_jb5khy8 wrote

You want this done in a controlled, methodical and documented manner, not earlier research which showed SD 1.5 to verbatim copy every line and minute contour of wood grain in a specific copyrighted "wooden table" background, found after training to be repeated tens of thousands of times in the input dataset (due to websites selling phone cases photoshopping phones onto it).

7

enjakuro t1_jb8thcg wrote

Yeah but copying data in a corpus has yielded better results, at least in NLP translation tasks. It's always good to know what's in your data though. Just saying that it might not be a bad thing.

1

graphicteadatasci t1_jb9afw5 wrote

Really? Because copying all your data once is the same as running your dataset twice per epoch instead of once. Doesn't sound right. Unless your test data is drawn from the same dataset and duplication happens before splitting in which case you would certainly expect metric improvements. Or was this a case of duplicating rare text in which case it is the opposite of having duplicate images in LAION.

1

enjakuro t1_jb9l86l wrote

Ah it was the rare text thing I believe. Now that I'm more awake I also realized that they copied the source to target, meaning the same language as source and target while keeping the rest bilingual. If I can recall correctly, you can have up to 50% copied data which makes the training set much bigger. I guess if the images aren't exactly the same this would have the same effect. Basically training a language model.

2

graphicteadatasci t1_jbdt33t wrote

Yeah, because there's some very nice results on classification models where they remove data that doesn't contribute to learning and it made training faster and more accurate. But of course I can't remember at all what the paper was called.

1

enjakuro t1_jbf0yco wrote

Same hahaha, would've linked it otherwise xD

1