Ronny_Jotten t1_j6wrlvv wrote on February 2, 2023 at 1:13 PM

Reply to comment by znihilist in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

I think pretty much everyone would have to agree that the brain - the original neural network - can memorize and reproduce images, though never 100% exactly. That's literally what we mean by the word memorize: to create a representation of something in a biological neural network in a way that it can be recalled and reproduced.

Can those pictures be found somewhere inside the brain, can you open a skull and point to them? Or is it just a function of neuronal connections that outputs such a picture? Is there "a difference between memorizing and pattern recreation"? It sounds like a "how many angels can dance on the head of a pin" sort of question that's not worth spending a lot of time on.

I don't think anyone should be surprised that an artificial neural network can exhibit a similar kind of behaviour, and that for convenience we would call it by the same word: "memorizing". I'm not saying that every single image is memorized, any more than I have memorized every image I've ever seen. But I do remember some very well - especially if I've seen them many times.

Some say that AIs "learn" from the images they "see", but somehow they refuse to say that they "memorize" too. If they're going to make such anthropomorphic analogies, it seems a bit selective, if not hypocritical.

The extent to which something is memorized, or the differences in qualities and how it takes place in an artificial vs. organic neural network, is certainly something to be discussed. But if you want to argue that it's not truly memorizing, like the argument that ANNs don't have true intelligence, well, ok... but that's also a kind of "no true Scotsman" argument that's a bit meaningless.

visarga t1_j6x1uwy wrote on February 2, 2023 at 2:34 PM

> The extent to which something is memorized ... is certainly something to be discussed.

One in a million chance of memorisation even when you're actively looking for them is not worth discussing about.

> We select the 350,000 most-duplicated examples from the training dataset and generate 500 candidate images for each of these prompts (totaling 175 million generated images). We find 109 images are near-copies of training examples.

On the other hand, these models compress billions of images into a few GB. There is less than 1 byte on average per input example, there's no space to have significant memorisation. Probably why there were only 109 memorised images found.

I would say I am impressed there were so few of them, if you use a blacklist for these images you can be 100% sure the model is not regurgitating training data verbatim.

I would suggest the model developers remove these images from the training set and replace them with variations generated with the previous model so they only learn the style and not the exact composition of the original. Replacing originals with variations - same style, different composition, would be a legitimate way to avoid close duplication.