Viewing a single comment thread. View all comments

TikiTDO t1_jb5f4p2 wrote

Honestly, the biggest problem with the dataset isn't the duplicates. It's the fact that most of the annotations are kinda crap. You know the saying an image is worth a thousand words. That may be too much for SD, but it will happily chew on 50-75 tokens. SD really wants a bunch of content it can parse on in order to understand concepts and how those concepts relate to each other, but most LAION annotations are short and simple.

From my experience, refining the model with a few hundred images with proper long-form annotations describing what you want can go a long way, even for complex things like hands.

25

Jurph t1_jb7kym3 wrote

I wonder whether the author of AUTOMATIC1111 could allow people to opt-in and send their training folder(s) of image-caption pairs to a central repository for use in a mega fine-tuning data set.

3

zaptrem t1_jb8i4cr wrote

> the author of AUTOMATIC1111

…you mean AUTOMATIC1111? That’s their name.

2

Jurph t1_jb9e3cw wrote

Yes, but if you read to the end of the book, you find out that actually, the Doctor is the real monster.

2

alushamir t1_jb9fdgy wrote

I agree that mislabels are also an issue.
You can see some examples in this video:
https://www.youtube.com/watch?v=s6qamoFzyis&t=7s

We have used fastdup to analyse Laion-400M.

1

TikiTDO t1_jb9thji wrote

That's interesting. More similarity than I expected.

That said, with my workflow I tend to not worry too much about dupes, since they are likely to end up with different labels focusing on different things. That said, my approach also requires a lot more manual steps and intervention, so I can definitely see how such a dedupe may help with the current setup.

In case anyone's interested, here's what I find works for me:

  1. First I started with a few hundred manually annotated images. I then used those to fine tune a version of BLIP VQA.

  2. Whenever I have new images, I have a script that will interrogate VQA for details about the picture (things like camera angle, number of people, the focus of the picture, and whether it satisfies any extra training criteria I have), and then get a gradCAM of key elements I may want to focus on. This will generate a JSON file with a lot of image information.

  3. I can then use the JSON file along with a language model to generate multiple information dense prompts that should correspond with the image.

  4. Based on my training goals at the time, I send an image into an generic approval queue where I can validate a few hundred images a day before sending it to my generic training location, in addition to that I may also send it into a specialised queue if I'm trying to train up a specific concept or idea. For example I'm working on hands at the moment. It can still obviously use some more work (It's still not sure what all the fingers are called and how they move), but there's no way I'd be able to get something like that out of vanilla SD 2.1. Note, it's also pretty important to have a good variety of related concepts in a specialised set; so for example, for hands you want old hands, young hands, man's hands, woman's hands, hand bones, hand muscles, pictures of people practising drawing hands, pictures of people doing things with hands, all annotated with some connecting terms, but also adding additional context that might not be available in other places.

  5. I will alternate small number of higher lr training cycles with new concepts and a lower batch size, and then a long low lr run for the larger training set with a higher batch size. This way I can constantly validate if it's learning the ideas I want to, and then reinforce those ideas. This has the secondary bonus that once I've validated the individual concept I generally won't have to worry about it if I ever restart training, and even if I do I can always pick out a few hundred images to refine things.

It's obviously a much slower process than just scraping the internet for a bunch of images and shoving them into CLIP, but it's reliable enough that I have tens of thousands of images at this point, which gets me some really nice results.

Incidentally, with the gradCAM data I can also use higher res pictures, which I can subdivide in zoomed in portions for studying particular topics.

2