Viewing a single comment thread. View all comments

koolaidman123 t1_j6ug73c wrote

46

IDoCodingStuffs t1_j6uk67h wrote

~~In this case the paper seems to use a very conservative threshold to avoid false positives -- l2 distance < 0.1, full image comparison. Which makes sense for their purposes, since they are trying to establish the concept rather than investigating its prevalence.

It is definitely a larger number than 0.03% when you pick a threshold to optimize the F score rather than just precision. How much larger? That's a bunch of follow-up studies.~~

17

starstruckmon t1_j6v1qv0 wrote

They also manually annotated the top 1000 results, adding only 13 more images. The number you're replying to counted those.

6

DigThatData t1_j6uxsdj wrote

> full image comparison.

that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.

Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.

Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.

2

-xXpurplypunkXx- t1_j6ulhcj wrote

I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

Very interesting. I'm wondering how sensitive this methodology is to finding instances of memorization though; maybe this is the tip of the iceberg.

6

LetterRip t1_j6ut9kc wrote

> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.

Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.

10

-xXpurplypunkXx- t1_j6v3fab wrote

Thanks for context. Maybe a little too much woo in my post.

For me, the fidelity to decide which images are completely stored is either an interesting artifact or an interesting piece of the model.

But regardless it is very un-intuitive to me with respect to how diffusion models would train and behave, due to both mutation of training images as well as foreseeable lack of space to encode that much info into a single model state. Admittedly don't have much working experience with these sort of models.

1

pm_me_your_pay_slips OP t1_j6vgxpe wrote

>on average it can learn 2 bits of unique information per image.

The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.

As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

1

LetterRip t1_j6vo0zz wrote

> The model capacity is not spent on learning specific images

I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).

> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.

Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.

11