mongoosefist t1_j6ufv6a wrote on February 1, 2023 at 11:28 PM

Is this really that surprising? Theoretically every image from clip should be in the latent space in a close-ish to original form. Obviously these guys went through a fair amount of trouble to recover these images, but it shouldn't surprise anyone that it's possible.

HateRedditCantQuitit t1_j6upt7k wrote on February 2, 2023 at 12:39 AM

It's funny that the top comment right now is that it shouldn't be surprising, because whenever the legal argument comes in, the most common defense is that these models categorically don't memorize.

znihilist t1_j6uy7z0 wrote on February 2, 2023 at 1:41 AM

I think people are using words and disagreeing on conclusions without agreeing first on what is exactly meant by those words.

I am not sure that everyone is using the word "memorize" the same. I think those who use it in the context of defense, are saying that those images are no where to be found in the model itself. It is just a function that takes words as an input and outputs a picture. Is the model memorizing the training data if it can recreate it? I don't know, but my initial intuition tells me there is a difference between memorizing and pattern recreation, even if they aren't easily distinguishable in this particular scenario.

znihilist t1_j6uz705 wrote on February 2, 2023 at 1:49 AM

If you have a set of pair numbers: (1,1)..(2,3.95)..(3,9.05)..(4, 16.001)..etc These can be fitted with x^2, but x^2 does not contain anywhere the four pairs of numbers, but can recreate them to a certain degree of precision if you try to guess the x values.

Is f(x) = x^2 memorizing the inputs or just able to recreate them because they are in the possible outcome space?

Ronny_Jotten t1_j6wsav3 wrote on February 2, 2023 at 1:19 PM

If I remember your face, does my brain contain your face? Can your face be found anywhere inside my brain? Or has my brain created a sort of close-fit formula, embodied in connections of neurons, that can reproduce it to a certain degree of precision? If the latter, does that mean that I haven't memorized your face, even though I can draw a pretty good picture of it?

visarga t1_j6x0qcm wrote on February 2, 2023 at 2:26 PM

I think their argument goes like this - when you encode an image to JPEG the actual image is replaced by DCT coefficients and reconstruction is only approximate. That doesn't make the image free of copyright.

znihilist t1_j6xa0o3 wrote on February 2, 2023 at 3:30 PM

My point is more to the fact that f(x) doesn't have 3.95 in it anywhere. Because another option would be to write f(x) as -(x-2)(x-3)(x-4)*1/6 -(x-1)(x-3)(x-4)*3.95/2 -(x-1)(x-2)(x-4)*9.05/2 + (x-1)(x-2)(x-3)*16.001/6 this recreates the original points, plug in 1 and you get -(-1)(-2)(-3)*1/6 -(0)(-2)(-3)*3.95/2 -(0)(-1)(-3)*9.05/2 + (0)(-1)(-2)*16.001/6 which is just 1.

This version of f(x) has "memorized" the inputs and is written as a direct function of these inputs, versus x^2 which has nothing in it that is retraced to the original inputs. Both of these functions are able to recreate the original inputs. Although one to infinite precision (RMSE = 0) and the other to an RMSE of ~0.035.

I think intuitively we recognize that these two functions are not the same even beyond their obvious differences (first is a 4th order power function, and the other is a 2nd order power function), either way. Point is, I think "memorize" while applicable in both cases, one stores a copy and the other is able to recreate from scratch, and I believe they do mean different things in their legal implications.

Also, I think it is very interesting the divide on this from a philosophical point of view, and with the genie being out of the bottle, then beside strong societal change and pressure that genie is never going back to the bottle.

Ronny_Jotten t1_j6wrlvv wrote on February 2, 2023 at 1:13 PM

I think pretty much everyone would have to agree that the brain - the original neural network - can memorize and reproduce images, though never 100% exactly. That's literally what we mean by the word memorize: to create a representation of something in a biological neural network in a way that it can be recalled and reproduced.

Can those pictures be found somewhere inside the brain, can you open a skull and point to them? Or is it just a function of neuronal connections that outputs such a picture? Is there "a difference between memorizing and pattern recreation"? It sounds like a "how many angels can dance on the head of a pin" sort of question that's not worth spending a lot of time on.

I don't think anyone should be surprised that an artificial neural network can exhibit a similar kind of behaviour, and that for convenience we would call it by the same word: "memorizing". I'm not saying that every single image is memorized, any more than I have memorized every image I've ever seen. But I do remember some very well - especially if I've seen them many times.

Some say that AIs "learn" from the images they "see", but somehow they refuse to say that they "memorize" too. If they're going to make such anthropomorphic analogies, it seems a bit selective, if not hypocritical.

The extent to which something is memorized, or the differences in qualities and how it takes place in an artificial vs. organic neural network, is certainly something to be discussed. But if you want to argue that it's not truly memorizing, like the argument that ANNs don't have true intelligence, well, ok... but that's also a kind of "no true Scotsman" argument that's a bit meaningless.

visarga t1_j6x1uwy wrote on February 2, 2023 at 2:34 PM

> The extent to which something is memorized ... is certainly something to be discussed.

One in a million chance of memorisation even when you're actively looking for them is not worth discussing about.

> We select the 350,000 most-duplicated examples from the training dataset and generate 500 candidate images for each of these prompts (totaling 175 million generated images). We find 109 images are near-copies of training examples.

On the other hand, these models compress billions of images into a few GB. There is less than 1 byte on average per input example, there's no space to have significant memorisation. Probably why there were only 109 memorised images found.

I would say I am impressed there were so few of them, if you use a blacklist for these images you can be 100% sure the model is not regurgitating training data verbatim.

I would suggest the model developers remove these images from the training set and replace them with variations generated with the previous model so they only learn the style and not the exact composition of the original. Replacing originals with variations - same style, different composition, would be a legitimate way to avoid close duplication.

SulszBachFramed t1_j6wa7ii wrote on February 2, 2023 at 9:47 AM

You can make the same argument about lossy compression. Am I really infringing on copyright if I record an episode of House, re-encode it and redistribute it? It's not the 'original' episode, but a lossy copy of it. What if I compress it in a zip file and distribute that? In that case, I am only sharing something that can imperfectly recreate the original. The zip file itself does not resemble a video at all.

Ronny_Jotten t1_j6wndrm wrote on February 2, 2023 at 12:33 PM

The test for copyright infringment is whether it's "substantially similar", not "exactly the same".

SulszBachFramed t1_j6wp97b wrote on February 2, 2023 at 12:51 PM

Right, hence why its relevant to large models trained on huge datasets. If the model can reconstruct data such that it is substantially similar to the original, then we have a problem. Whether from the viewpoint of copyright infringement or privacy law (gdpr).

znihilist t1_j6xcp1i wrote on February 2, 2023 at 3:48 PM

Good point, but the way I see it these two things look very similar but don't end up being similar in the way we thought or wanted. Compression takes one input and generates an output, the object (the file if you want) is only one thing, an episode of house. We'd argue that both versions are loosely identical, just differ in the underlying presentation (their 0's and 1's are different but they render the same object). Also, that object can't generate another episode of house (that aired a day early), or a none existing episode of house that he takes over the world, or where he's a Muppet. As the diffusion models don't have a copy, then the comparison falls on that particular aspect as none-applicable.

I do think, the infringement aspect is going to end up being by the user and not by the tool. Akin to how just because your TV can play pirated content, we assign the blame on the user and not the manufacturer of the TV. So it may end up being that creating these models is fine, but if you recreate something copyrighted, then that will be on you.

Either way, this is going to be one interesting supreme court decision (because I think it is definitely going there).

JigglyWiener t1_j6xy6ys wrote on February 2, 2023 at 6:01 PM

Infringing content can be created with any number of tools and we don’t sue photoshop for not detecting someone trying to alter images of what is clearly Mickey Mouse. We sue the person when they are making money off of the sale of copyrighted material.

It’s not worth chasing copyright for Pennies

Ronny_Jotten t1_j6yenlh wrote on February 2, 2023 at 7:44 PM

Adobe doesn't ship Photoshop with a button that produces an image of Mickey Mouse. They would be sued by Disney. The AI models do. They are not the same. It seems unlikely that Disney will find it "not worth chasing"; they spend millions defending their intellectual property.

JigglyWiener t1_j6yxwhz wrote on February 2, 2023 at 9:43 PM

The models don’t come with buttons that do anything. They are tools capable only of what the software developers permit to enter the models and what users request.

If we go down the road of regulating training and capacity to do x, you’ll have to file lawsuits against every artist on behalf of every copyright holder over the IP inside the artist’s head.

These cases are going to fall apart and copyright holders are going to go after platforms that don’t put reasonable filters in place.

Ronny_Jotten t1_j6z9axn wrote on February 2, 2023 at 10:58 PM

> The models don’t come with buttons that do anything. They are tools capable only of what the software developers permit to enter the models and what users request.

If you prompt an AI with "Mickey Mouse" - no more effort than clicking a button - you'll get an image of Mickey Mouse that violates intellectual property laws. The image, or the instructions for producing it, is contained inside the model, because many copyrighted images were digitally copied into the training system by the organization that created the model. It's just not remotely the same thing as someone using the paintbrush tool in Photoshop to draw a picture of Mickey Mouse themselves.

> If we go down the road of regulating training and capacity to do x, you’ll have to file lawsuits against every artist on behalf of every copyright holder over the IP inside the artist’s head.

I don't think you have a grasp of copyright law. That is a tired and debunked argument. Humans are allowed to look at things, and remember them. Humans are not allowed to make copies of things using a machine - including loading digital copies into a computer to train an AI model - unless it's covered by a fair use exemption. Humans are not the same as machines, in the law, or in reality.

> These cases are going to fall apart

I don't think they will. Especially for the image-generating AIs, it's going to be difficult to prove fair use in the training, if the output is used to compete economically with artists or image owners like Getty, whose works have been scanned in, and affect the market for that work. That's one of the four major requirements for fair use.

maxToTheJ t1_j6x4dc8 wrote on February 2, 2023 at 2:52 PM

Thats a bad argument . MP3s are compressed versions for the original file for many songs so the original isn’t exactly in the MP3 until the decompression is applied. Would anybody argue that since a transformation is applied in the form of a decompression algo that Napster was actually in the clear legally

znihilist t1_j6x5c0y wrote on February 2, 2023 at 2:58 PM

MP3 can recreate only the original version. They can't recreate other songs that has never been created or thought of. Compression only relates to one input and one output exactly. As such, this comparison falls apart when you apply it to these models.

maxToTheJ t1_j6yo1eq wrote on February 2, 2023 at 8:43 PM

> They can't recreate other songs that has never been created or thought of.

AFAIK having a not copyrighting violating use doesnt excuse a copyright violating use.

znihilist t1_j6z78wg wrote on February 2, 2023 at 10:44 PM

That's beside the point, my point is that the MP3 compression comparison doesn't work, so that line of reasoning isn't applicable. Whether one use can excuse another isn't part of the argument.

maxToTheJ t1_j6zy5z8 wrote on February 3, 2023 at 1:59 AM

>That's beside the point,

It does for the comment thread which was about copyright

> my point is that the MP3 compression comparison doesn't work,

It does for the part that is actually the point (copyright law).

znihilist t1_j704b3j wrote on February 3, 2023 at 2:46 AM

>> That's beside the point,

> It does for the comment thread which was about copyright

It doesn't, as this is issue has not been decided by courts or laws yet, and opinion seems to be evenly divided. So this is circular logic.

>> my point is that the MP3 compression comparison doesn't work,

> It does for the part that is actually the point (copyright law).

You mentioned MP3 (compressed versions) as comparable in functionality, and my argument is about how they are not similar in functionality, so the conclusion doesn't follow as they are not comparable in that analysis. Compression not absolving copyright infringement doesn't lead to the same thing being concluded for diffusion models. As you asserted that, you need to show show compression and diffusion follow the same functionality for that comparison to work. That's like if I say that it isn't illegal that I can look at a painting and then go home and have vivid images of that painting therefore diffusion models are not doing any infringement, that would be fallacious and wrong, functionality doesn't follow, the same for MP3 example.

maxToTheJ t1_j70et3o wrote on February 3, 2023 at 4:11 AM

>You mentioned MP3 (compressed versions) as comparable in functionality,

Facepalm. For the identity part not the whole thing.

Wiskkey t1_j6v0hqg wrote on February 2, 2023 at 1:58 AM

The fact that Stable Diffusion v1.x models memorize images is noted in the various v1.x model cards. For example, the following text is from the Stable Diffusion v1.5 model card:

>No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.

Argamanthys t1_j6w9gal wrote on February 2, 2023 at 9:36 AM

There is a short story called The Library of Babel about a near-infinite library that contains every possible permutation of a book with 1,312,000 characters. It is not hard to recreate that library in code. You can explore it if you want.

Contained within that library is a copy of every book ever written, freely available to read.

Is that book piracy? It's right there if you know where to look.

That's pretty much what's going on here. They searched the latent space for an image and found it. But that's because the latent space, like the Library of Babel is really big and contains not just that image but also near-infinite permutations of it.

SuddenlyBANANAS t1_j6waypu wrote on February 2, 2023 at 9:58 AM

If diffusion models were a perfect bijection between the latent space and the space of possible images, that would make sense, but they're obviously not. If you could repeat this procedure and find exact duplicates of images which were not in the training data, you'd have a point.

starstruckmon t1_j6xbhe1 wrote on February 2, 2023 at 3:40 PM

>find exact duplicates of images which were not in the training data, you'd have a point

The process isn't exactly the same, but isn't this how all the diffusion based editing techniques work?

WikiSummarizerBot t1_j6w9h7w wrote on February 2, 2023 at 9:36 AM

The Library of Babel

>"The Library of Babel" (Spanish: La biblioteca de Babel) is a short story by Argentine author and librarian Jorge Luis Borges (1899–1986), conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain format and character set. The story was originally published in Spanish in Borges' 1941 collection of stories El jardín de senderos que se bifurcan (The Garden of Forking Paths). That entire book was, in turn, included within his much-reprinted Ficciones (1944).

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

maxToTheJ t1_j6x4vrz wrote on February 2, 2023 at 2:55 PM

> That's pretty much what's going on here.

No its not. We wouldn’t need training sets if that was the case like in the scenario described where you can generate the dataset using a known algo

Mescallan t1_j6wbz6f wrote on February 2, 2023 at 10:13 AM

surmise-able information is not the same as memorization.

Laphing_Drunk t1_j6ur796 wrote on February 2, 2023 at 12:49 AM

Yeah, model inversion attacks aren't new. It's reasonable to assume that large models, especially generative models that make no effort to be resilient, are susceptible to this.

maxToTheJ t1_j6vqzvb wrote on February 2, 2023 at 5:39 AM

>Is this really that surprising?

It should be to all the people who claim these models are solely transformative in all the threads about the court cases related to generative model.

bushrod t1_j6vaal9 wrote on February 2, 2023 at 3:11 AM

What theory are you referring to when you say "theoretically"?

mongoosefist t1_j6wed0f wrote on February 2, 2023 at 10:47 AM

When the latent representation is trained, it should learn an accurate representation of the training set, but obviously with some noise because of the regularization that happens by learning the features along with some guassian noise in the latent space.

So by theoretically, I meant that due to the way the VAE is trained, on paper you could prove that you should be able to get an arbitrarily close representation of any training image if you can direct the denoising process in a very specific way. Which is exactly what these people did.

I will say there should be some hand waving involved however, because again even though it should be possible, if you have enough images that are similar enough in the latent space that there is significant overlap between their distributions, it's going to be intractably difficult to recover these 'memorized' images.

quichemiata t1_j6v6e42 wrote on February 2, 2023 at 2:42 AM

That last step might as well be

> generate infinite copies until one matches

NitroXSC t1_j6wdaje wrote on February 2, 2023 at 10:32 AM

> Compute CLIP embeddings for the images in a training dataset.

A good follow-up question is to ask if it would be possible to recover a lot of the training data if you don't know the training data a priori.

mongoosefist t1_j6zfxe8 wrote on February 2, 2023 at 11:44 PM

How would you know that you had recovered it if you didn't know the training data a priori?

NitroXSC t1_j70z581 wrote on February 3, 2023 at 7:49 AM

https://en.m.wikipedia.org/wiki/Differential_privacy

Differential privacy has multiple methods of recovering the input data from output data, but that is most often only quite simple models. Hence it might be possible.

mongoosefist t1_j71dbhq wrote on February 3, 2023 at 11:06 AM

Differential privacy methods work in a way that's quite similar to the denoising process of diffusion models already. The problem is that in most Differential privacy methods they rely on the discreteness of data. The latent space of diffusion models is completely continuous, so there is no way to tell the difference between similar images, and thus you can't tell which ones are from the training data if any at all.

For example, if you're pretty sure the diffusion model has memorized an oil painting of Kermit the frog, there is no way for you to say with any reasonable amount of certainty whether images you are denoising that turn out to be oil paintings of Kermit are from actual pictures, or from the distribution of oil paintings overlapping with the distribution of pictures of Kermit from the latent space, because there is no hard point where one transitions to the other, or a meaningful difference in density between the distribution

WikiSummarizerBot t1_j70z6dw wrote on February 3, 2023 at 7:50 AM

Differential privacy

>Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

RandomCandor t1_j6uaa0o wrote on February 1, 2023 at 10:50 PM

Fascinating. I always thought this sort of thing was either very difficult or impossible.

koolaidman123 t1_j6ug73c wrote on February 1, 2023 at 11:31 PM

it is, the memorization rate is like 0.03% or less

https://twitter.com/BlancheMinerva/status/1620781482209087488

IDoCodingStuffs t1_j6uk67h wrote on February 1, 2023 at 11:58 PM

~~In this case the paper seems to use a very conservative threshold to avoid false positives -- l2 distance < 0.1, full image comparison. Which makes sense for their purposes, since they are trying to establish the concept rather than investigating its prevalence.

It is definitely a larger number than 0.03% when you pick a threshold to optimize the F score rather than just precision. How much larger? That's a bunch of follow-up studies.~~

starstruckmon t1_j6v1qv0 wrote on February 2, 2023 at 2:07 AM

They also manually annotated the top 1000 results, adding only 13 more images. The number you're replying to counted those.

DigThatData t1_j6uxsdj wrote on February 2, 2023 at 1:38 AM

> full image comparison.

that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.

Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.

Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.

[deleted] t1_j6uyzlk wrote on February 2, 2023 at 1:47 AM

[deleted]

-xXpurplypunkXx- t1_j6ulhcj wrote on February 2, 2023 at 12:08 AM

I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

Very interesting. I'm wondering how sensitive this methodology is to finding instances of memorization though; maybe this is the tip of the iceberg.

LetterRip t1_j6ut9kc wrote on February 2, 2023 at 1:04 AM

> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.

Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.

-xXpurplypunkXx- t1_j6v3fab wrote on February 2, 2023 at 2:20 AM

Thanks for context. Maybe a little too much woo in my post.

For me, the fidelity to decide which images are completely stored is either an interesting artifact or an interesting piece of the model.

But regardless it is very un-intuitive to me with respect to how diffusion models would train and behave, due to both mutation of training images as well as foreseeable lack of space to encode that much info into a single model state. Admittedly don't have much working experience with these sort of models.

pm_me_your_pay_slips OP t1_j6vgxpe wrote on February 2, 2023 at 4:04 AM

>on average it can learn 2 bits of unique information per image.

The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.

As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

LetterRip t1_j6vo0zz wrote on February 2, 2023 at 5:09 AM

> The model capacity is not spent on learning specific images

I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).

> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.

Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.

DigThatData t1_j6ugpgr wrote on February 1, 2023 at 11:34 PM

very difficult is correct. The authors identified 350,000 candidate prompt/image pairs that were likely to have been memorized because they were duplicated repeatedly in the training data, and were only able to find 109 cases of memorization in Stable Diffusion in that 350k.

EDIT:

Conflict of Interest Disclosure: I'm a Stability.AI employee, and as such I have a financial interest in protecting the reputation of generative models generally and SD in particular. Read the paper for yourself. Everything here is my own personal opinion, and I am not speaking as a representative of Stability AI.

My reading is that yes: they demonstrated these models are clearly capable of memorizing images, but also that they are clearly capable of being trained in a way that makes them fairly robust to this phenomenon. Imagen has a higher capacity and was trained on much less data: it unsurprisingly is more prone to memorization. SD was trained on a massive dataset and has a smaller capacity: after constraining attention to the content we think it had the best excuse to have memorized, it barely memorized any of it.

There's almost certainly a scaling law here, and finding it will permit us to be even more principled about robustness to memorization. My personal reading of this experiment is that SD is probably pretty close to the pareto boundary here, and we could probably flush out the memorization phenomenon entirely if we train it on more data or ~~trim away at the capacity~~ tinker with the model's topology.

Nhabls t1_j6uokwb wrote on February 2, 2023 at 12:30 AM

It's incredibly easy to make giant LLMs regurgitate training data near verbatim. There's very little reason to believe that this won't just start happening more frequently with image models as they grow in scale as well.

Personally i just hope it brings a reality check in the courts to these companies that think they can just monetize generative models trained on copyrighted material without permission

ItsJustMeJerk t1_j6uqkv6 wrote on February 2, 2023 at 12:45 AM

Actually, data has shown after a certain size larger models end up generalizing more than smaller ones. It's called double descent.

Nhabls t1_j6urk1b wrote on February 2, 2023 at 12:52 AM

This isn't really relevant. Newer, larger LLMs generalize better than smaller ones yet they also regurgitate training data better. it's not exclusive

ItsJustMeJerk t1_j6uymag wrote on February 2, 2023 at 1:44 AM

You're right, it's not exclusive. But I believe that while the the absolute amount of data memorized might go up with scale, it occupies a smaller fraction of the output because it's only used where verbatim recitation is necessary instead of as a crutch (I could be wrong though). Anyway, I don't think that crippling the model by removing all copyrighted data from the dataset is a good long-term solution. You don't keep students from plagiarizing by preventing them from looking at a source related to what they're writing.

DigThatData t1_j6uu82y wrote on February 2, 2023 at 1:11 AM

This is true, and also generalization and memorization are not mutually exclusive.

EDIT: I can't think of a better way to articulate this, but the image that keeps coming to my mind is a model memorizing the full training data and simulating a nearest neighbors estimate.

pm_me_your_pay_slips OP t1_j6wn43x wrote on February 2, 2023 at 12:30 PM

That models that memorize better generalize better has been observed in large language models:
https://arxiv.org/pdf/2202.07646.pdf

https://arxiv.org/pdf/2205.10770.pdf

An interesting way to quantify memorization is proposed here, although it will be expensive for a model like SD: https://proceedings.neurips.cc/paper/2021/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf.

Basically: you perform K-fold cross validation and measure how much more likely the image is when included in the training dataset vs when it is not included. For memorized images, the likelihood of the images when not used in the dataset drops to close to zero. Note that they caution against using the nearest neighbour distance to quantify memorization as it is not correlated with the described memorization score.

DigThatData t1_j6xexyf wrote on February 2, 2023 at 4:02 PM

> That models that memorize better generalize better has been observed in large language models

I think this is an incorrect reading here. increasing model capacity is a reliable strategy for increasing generalization (Kaplan et al 2020, Scaling Laws), and larger capacity models have a higher propensity to memorize (your citations). The correlations discussed in both of those links are to capacity specifically, not generalization ability broadly. scaling law research has recently been demonstrating that there is probably a lot of wasted capacity in certain architectures, which suggests that the generalization potential of those models could be achieved with a much lower potential for memorization. see for example Tirumala et al 2022, Chinchilla.

which is to say: you're not wrong that a lot of recently trained models that generalize well have also been observed to memorize. but I don't think it's accurate to suggest that the reason these models generalize well is linked to a propensity/ability to memorize. it's possible this is the case, but I don't think anything suggesting this has been demonstrated. it seems more likely that generalization and memorization are correlated through the confounder of capacity, and contemporary research is actively attacking the problem of excess capacity in part to address the memorization question specifically.

EDIT: Also... I have some mixed feelings about that last paper. It's new to me and I just woke up so I'll have to take another look after I've had some coffee, but although their approach feels intuitively sound from the direction of the LOO methodology, their probabilistic formulation of memorization I think is problematic. They formalize memorization using a definition that appears to me to be indistinguishable from an operational definition of generalizability. Not even OOD generalizability: perfectly reasonable in-distribution generalization to unseen data, according to these researchers, would have the same properties as memorization. That's... not helpful. Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized". Their approach appears to make no effort to distinguish between a model that had "memorized" a training datum and one that had "learned" meaningful features in the neighborhood of a datum that has high [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics). Are they detecting memorization or outlier samples? If the "outliers" are valid in distribution samples, removing them harms the diversity of the dataset and the model may have significantly less opportunity to learn features in the neighborhood of those observations (i.e. they are high leverage). My understanding is that the problem of memorization is generally more pathological in high density regions of the data, which would be undetectable by their approach.

pm_me_your_pay_slips OP t1_j6yl0wq wrote on February 2, 2023 at 8:24 PM

The first paper proposes a way of quantifying memorization by looking at pairs of prefixes and postfixes and observing whether the postfixes wer generated by the model when the prefixes were used as prompts.

The second paper has this to say about generalization:

> A natural question at this point is to ask why larger models memorize faster? Typically, memorization is associated with overfitting, which offers a potentially simple explanation. In order to disentangle memorization from overfitting, we examine memorization before overfitting occurs, where we define overfitting occurring as the first epoch when the perplexity of the language model on a validation set increases. Surprisingly, we see in Figure 4 that as we increase the number of parameters, memorization before overfitting generally increases, indicating that overfitting by itself cannot completely explain the properties of memorization dynamics as model scale increases.

In fact, this is the title of the paper: "Memorization without overfitting".

> Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized".

The memorization score is not "lower posterior likelihood", but the log density ratio for a sample: log( p(sample| dataset including sample)/p(sample| dataset excluding sample) ) . Thus, a high memorization score is given to samples that go from very unlikely when not included to as likely as the average sample when included in the training data, or from as likely as the average training sample when not included in the training data to above-average likelihood when included.

DigThatData t1_j6ynesq wrote on February 2, 2023 at 8:39 PM

> p(sample| dataset including sample)/p(sample| dataset excluding sample) )

which, like I said, is basically identical to statistical leverage. If you haven't seen it before, you can compute LOOCV for a regression model directly from the hat matrix (which is another name for the matrix of leverage values). This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.

> What's the definition of memorization here? how do we measure it?

I'd argue that what's at issue here is differentiating between memorization and learning. My concern regarding the density ratio here is that a model that had learned to generalize well in the neighborhood of the observation in question would behave the same way, so this definition of memorization doesn't differentiate between memorization and learning, which I think effectively renders it useless.

I don't love everything about the paper you linked in the OP, but I think they're on the right track by defining their "memorization" measure by probing the model's ability to regenerate presumably memorized data, especially since our main concern wrt memorization is in regards to the model reproducing memorized values.

pm_me_your_pay_slips OP t1_j6ypajq wrote on February 2, 2023 at 8:50 PM

>This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.

The paper has this to say about your point

> If highly memorized observations are always given a low probability when they are included in the training data, then it would be straightforward to dismiss them as outliers that the model recognizes as such. However, we find that this is not universally the case for highly memorized observations, and a sizable proportion of them are likely only when they are included in the training data.

> Figure 3a shows the number of highly memorized and “regular” observations for bins of the log probability under the VAE model for CelebA, as well as example observations from both groups for different bins. Moreover, Figure 3b shows the proportion of highly memorized observations in each of the bins of the log probability under the model. While the latter figure shows that observations with low probability are more likely to be memorized, the former shows that a considerable proportion of highly memorized observations are as likely as regular observations when they are included in the training set. Indeed, more than half the highly memorized observations fall within the central 90% of log probability values.

TLDR if this method was giving you a high score to outliers only, then these samples would have low likelihood when they were included in the training data (because they are outliers). But the authors observed sizeable proportion of the samples with high memorization score to be as likely as regular (inlier) data.

A_fellow t1_j6xq9gj wrote on February 2, 2023 at 5:13 PM

Pretending stability had or will have any principles other than profit is laughable.

DigThatData t1_j6y35x2 wrote on February 2, 2023 at 6:32 PM

It's a startup that evolved out of a community of people who found each other through common interests in open source machine learning for public good (i.e. eleuther and laion), committed to providing the public with access to ML tools that were otherwise gated by corporate paywalls. For several years, that work was all being done by volunteers in their free time. We're barely a year old as an actual company and we're not perfect. But as far as intentions and integrity go: you're talking about a group of people who were essentially already functioning as a volunteer run non-profit, and then were given the opportunity to continue that work with a salary, benefits, and resources.

If profit was our chief concern, we wouldn't be giving these models away for free. Simple as that. There're plenty of valid criticisms you could lob our way, but a lack of principles and greed aren't among them. You might not like the way we do things or certain choices we've made, but if you think the intentions behind those decisions is primarily profit motivated: you should really learn more about the people you are criticizing, because you couldn't be more misinformed.

[deleted] t1_j7cyayu wrote on February 5, 2023 at 9:43 PM

[removed]

LetterRip t1_j6uskhi wrote on February 2, 2023 at 12:59 AM

That only works for images for which the model has seen the image a 1000 times or so (ie 100 copies of the image seen 10 times each). It requires massive overtraining to memorize an image.

pm_me_your_pay_slips OP t1_j6uw5xs wrote on February 2, 2023 at 1:26 AM

where do you get that number?

starstruckmon t1_j6v3etd wrote on February 2, 2023 at 2:20 AM

From paper

>Our attack extracts images from Stable Diffu- sion most often when they have been duplicated at least k = 100 times

for the 100 number. The 10 is supposed to be the number of epochs, but I don't think it was trained on that many epochs. More like 5 or so ( you can look at the model card ; it's not easy to give an exact number ).

danielfm123 t1_j6vmxlu wrote on February 2, 2023 at 4:58 AM

Artists will be requesting their copyright...

[deleted] t1_j6vdena wrote on February 2, 2023 at 3:35 AM

[removed]

enryu42 t1_j6wme0p wrote on February 2, 2023 at 12:23 PM

Nice! It is pretty clear that big models memorize some of their training examples, but the ease of extraction is impressive.

I wonder what would be the best mitigation strategies (besides the obvious one of de-duplicating training images). Theoretically sound approaches (like differential privacy) will perhaps cripple the training too much. I wonder if some simple hacks would work: e.g. train the model as-is first, then generate an entirely new training set using the model and synthetic prompts, and train a new model from scratch only on the generated data.

Another aspect of this is on the user experience side. People can reproduce copyrighted images with just pen and paper, but they'll be fully aware of what they're doing in such case. With diffusion models, the danger is, the user can reproduce an existing image without realizing it. Maybe augmenting the various UI's with reverse image search/nearest neighbor lookup would be a good idea? Or computing training set attributions for generated images with something along the lines of tracin.

[deleted] t1_j6x3qal wrote on February 2, 2023 at 2:47 PM

>With this method, the authors were able to find samples from Stable Diffusion and Imagen corresponding to copyrighted training images.

Well this will split the room.

GoofAckYoorsElf t1_j6vwbgm wrote on February 2, 2023 at 6:38 AM

Well, there goes a main argument against the copyright warriors... Damn...

Ulfgardleo t1_j6w8snb wrote on February 2, 2023 at 9:26 AM

"copyright warriors"

do you care about what is right, or what you like?

GoofAckYoorsElf t1_j6wbljw wrote on February 2, 2023 at 10:07 AM

Both, actually. I can easily echo this question back to the people I call copyright warriors. Do they care about what is right or what they like? Right would be that everyone took an objective and unbiased look at the new technology and how to incorporate it into their work, instead of seeing only and aggressively clinging to their crumbling business models.

Ulfgardleo t1_j6wcfav wrote on February 2, 2023 at 10:19 AM

no you are now just writing what you like.

Is it right to use someone elses work without asking nor paying for it?

GoofAckYoorsElf t1_j6whme3 wrote on February 2, 2023 at 11:29 AM

There is no simple answer to that. It clearly depends on the person whose work I use, on the purpose (fair use, inspiration), on the credit that I give, on the way, society benefits from either them clinging to their business model or me being allowed to use their work, on so many different things that there simply is no simple answer.

A_fellow t1_j6xqk7y wrote on February 2, 2023 at 5:15 PM

Of course the unbiased side completely agrees with you at every step.

What a scam.

Comments