Viewing a single comment thread. View all comments

IDoCodingStuffs t1_j6uk67h wrote

~~In this case the paper seems to use a very conservative threshold to avoid false positives -- l2 distance < 0.1, full image comparison. Which makes sense for their purposes, since they are trying to establish the concept rather than investigating its prevalence.

It is definitely a larger number than 0.03% when you pick a threshold to optimize the F score rather than just precision. How much larger? That's a bunch of follow-up studies.~~

17

starstruckmon t1_j6v1qv0 wrote

They also manually annotated the top 1000 results, adding only 13 more images. The number you're replying to counted those.

6

DigThatData t1_j6uxsdj wrote

> full image comparison.

that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.

Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.

Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.

2