Submitted by vyasnikhil96 t3_1190lw8 in MachineLearning

Hi everyone, in a new paper we give a way to certify that a generative model does not infringe on the copyright of data that was in its training set.

Twitter thread:



Abstract: >There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data C that was in their training set. We give a formal definition of near access-freeness (NAF) and prove bounds on the probability that a model satisfying this definition outputs a sample similar to C, even if C is included in its training set. Roughly speaking, a generative model p is k-NAF if for every potentially copyrighted data C, the output of p diverges by at most k-bits from the output of a model q that did not access C at all. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content.



You must log in or register to comment.

ichiichisan t1_j9k3sx1 wrote

Although this is interesting work, you are no lawyers and will not be able to provide "provable copyright protection".


vyasnikhil96 OP t1_j9k57az wrote

I agree that the final say rests with the courts. But do you think there is something specific that we use or claim that differs from how the copyright law is currently implemented?


bluemason t1_j9kqrdg wrote

Not a lawyer, but yes.

Copyright infringement goes beyond reproduction of a work. An original adaptation can also violate copyright.


vyasnikhil96 OP t1_j9ksj4v wrote

We already handle that as our notion is not based on reproduction but is rather a information-theoretic notion. We also have a parameter that measures how much information we have "reproduced" vs adapted which can be set depending on the underlying models and the use case.


sam__izdat t1_j9mgsbr wrote

There's no information-theoretic notion of character copyright, for example. It's a game of calvinball, and has been since the Stationers' Company. It's true that copyright is badly misunderstood and over-generalized to things that it has absolutely nothing to do with, like plagiarism and other notions of (nonexistent) authorship rights, but it isn't a measurable thing either and you can't guarantee that the law and policy will agree with your model.


WikiSummarizerBot t1_j9mgtut wrote

Copyright protection for fictional characters

>Copyright protection is available to the creators of a range of works including literary, musical, dramatic and artistic works. Recognition of fictional characters as works eligible for copyright protection has come about with the understanding that characters can be separated from the original works they were embodied in and acquire a new life by featuring in subsequent works.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)


vyasnikhil96 OP t1_j9mu7ak wrote

Thanks for the interesting link! For the kind of copyright in your link deduplication of the data might be hard i.e. to assume that works which have access to this copyright only occur one or a few times, since it is a character and we don't really know which characters are copyrighted and which are not. But our paper assumes deduplication has been done beforehand.

Coming back to our notion, I am not trying to say that there is a already established information theoretic notion of copyright.

The copyright law (as we state in paper) relies on two things: 1. access to the copyrighted work must be proved and 2. substantial similarity to the copyrighted work must be established. Our notion cleanly separates these two and we come up with a information-theoretic way to quantify the "substantial similarity" aspect (denoted by k-NAF). For example, the strongest setting of k = 0 will never violate copyright (but also might degrade performance or be impossible to achieve) because it is equivalent to not having access. Larger values of k tradeoff between model performance and a possible increase in "substantial similarity". What k is valid for which setting to prevent copyright violation is not something we are establishing but rather that depends on the specific setting and must be determined by the law. The user can tune the value of k (assuming feasibility of the value) to the value considered acceptable by the law.


sam__izdat t1_j9ngakh wrote

I don't have any technical criticism that would be useful to you (and frankly it's above my pay grade), but to expand on what I meant when I said that it's a game of calvinball, there's some history here worth considering. Copyright has gone through myriad justifications.

If we wanted to detect offending content by the original standards of the Stationers' Company, then it may be useful to look for signs of sedition and heresy, since the stated purpose was "to stem the flow of seditious and heretical texts."

By the justification of the liberals who came after, typesetting, being a costly and error-prone process, forced their hand to protect the integrity of the text. So, if for some reason we wanted to take that goal seriously, it might make sense to look for certain kinds of dissimilarity instead: errors and distortions in reproductions. After all, that was the social purpose of the monopoly right.

If the purpose of the copyright regime today is to secure the profits of private capital in perpetuity, then simple metrics of similarity aren't enough to guarantee a virtual Blackstonian land right either.

For example:

> In our discussions, we refer to C ∈ C abstractly as a “piece of copyrighted data”, but do not specify it in more detail. For example, in an image generative model, does C correspond to a single artwork, or the full collected arts of some artists? The answer is the former. The reason is that if a generative model generates data that is influenced by the full collected artworks of X, but not by any single piece, then it is not considered a copyright violation. This is due to that it is not possible to copyright style or ideas, only a specific expression. Hence, we think of C as a piece of content that is of a similar scale to the outputs of the model.

That sounds reasonable. Is it true?

French and Belgian IP laws, for example, consider taking an original photo of a public space showing protected architecture a copyright violation. Prior to mid 2016, taking a panoramic photo with the Atomium in the background was copyright infringement. Distributing a night photo of the Eiffel tower is still copyright infringement today. So, how would you guarantee that a diffusion model fall within the boundaries of arbitrary rules when those tests of "substantial similarity" suddenly become a lot more ambiguous than anticipated?


iidealized t1_j9kwe6h wrote

Are adversarial examples (eg minimally perturbed versions of images) considered violation of copyright? Or are they a sufficient “remix”?


currentscurrents t1_j9l30jy wrote

It's definitely a derivative work, but whether it violates copyright is complicated and depends what you're doing with it.

Similarly, a scaled-down thumbnail of an image is also a derivative work. You couldn't print and sell thumbnail-sized reproductions of copyrighted artworks. But many uses of thumbnails, for example in search engine results, do not violate copyright.


visarga t1_j9qwl8q wrote

Diffusion models take about 1 byte of information from each training image - 5B images, 5Gb. So much less than a thumbnail.


vyasnikhil96 OP t1_j9kzwpq wrote

Assuming you are asking from the perspective of copyright law, I am not sure. I think the notion of “remix”/sufficient transformation also depends on the context in which the new work is being used.


Battleagainstentropy t1_j9lpzkd wrote

This is really interesting work. I wonder how meaningful these metrics can be made. For example, if I write a book about Harry Potter, the expert mug maker, then your metric is x. If I write about Barry Blotter, the boy wizard at Smogwarts, then your metric is y. IANAL but I think that the value needed to prove derivative work is a question of fact that would be up to a jury to decide, so being able to explain such a metric to laypeople could make this work. It’s somewhat similar to the way that DNA testing required a certain amount of education for juries (one in a million match and all that)


vyasnikhil96 OP t1_j9ltbq3 wrote

I agree. Note that overall there are two things we can hope for: 1. Using this approach with a appropriate k removes most of the "obvious" copyright violations and 2. for the remaining images the value k can be interpreted to determine whether there was a copyright violation or not, where the interpretation will necessarily be application and context dependent.


IdentifiableParam t1_j9ksw3a wrote

Really interesting work. Might be worth doing even if courts decide this isn't going to work as a legal defense.


1973DodgeChallenger t1_j9kz2wb wrote

It's an interesting problem... I ask ChatGPT for code, it spits out something that it mined from GitHub. Microsoft didn't just by github to spend money. They knew it was one of the best, if not the best, source for AI code mining. So Yoink! I set all of my github projects to private but I don't know if that helps. The user agreement may be structured to "anonymously mine" code private or otherwise.

So ya...if you store your code on GitHub...I'd bet a dollar Microsoft/OpenAi will be mining it and eventually burp it out in Chat GPT.


currentscurrents t1_j9ld7we wrote

>it spits out something that it mined from GitHub.

Having used GitHub Copilot a bunch, it's doing a lot more than just mining snippets. It learns patterns and can use them to creatively solve new problems.

It does memorize short snippets in some cases (especially when a snippet is repeated many times in training data), but in the general case it comes up with new code to match your specifications.

>I set all of my github projects to private but I don't know if that helps.

Honestly, kinda selfish. We'll all benefit from these powerful new tools and I don't appreciate you trying to hamper them.


Disastrous_Elk_6375 t1_j9nrm6w wrote

> It does memorize short snippets in some cases (especially when a snippet is repeated many times in training data)

And, to be fair, how can it not? How many different ways can you write a simple for loop to print some objects, or match a regex, call an API, and so on?


visarga t1_j9qxgt2 wrote

If you go down to individual words or characters, everything is reused. If you go up, usually a random 10 word snippet is nowhere else in the internet. But boilerplate and basic things might be replicated in all shapes and forms.


1973DodgeChallenger t1_j9lgjq4 wrote

Just for example, you work at a company that has spent millions investing in a proprietary software product. You're saying everyone should have access to the source code, through Chat GPT or otherwise?

Can I have all of your and your companies source code please. I'll send you my email address.


currentscurrents t1_j9pb0by wrote

You had your source code public until you got freaked out by ChatGPT, so you were entirely okay with publishing it for everyone to see.

ChatGPT doesn't even allow direct access to source code, it's just learning how to solve problems using existing source code as training examples.


visarga t1_j9qxt97 wrote

Well, you can't. Because it is really hard to extract any verbatim replications of training data from chatGPT. You need to put a considerable portion from the work as prompt, to put the model in the right place, and then sample your way ahead. Doesn't work for most stuff, like 99%.


visarga t1_j9qwzlf wrote

> Honestly, kinda selfish. We'll all benefit from these powerful new tools and I don't appreciate you trying to hamper them.

They took their little pebble from the beach back home, that'll show them.


hpstring t1_j9kkkr0 wrote

Interesting research! Let's try to convince artists with this kind of work


currentscurrents t1_j9l0g64 wrote

As long as it's capable of making art that can compete with human art, they're still not going to like it.

"Never argue with a man whose job depends on not being convinced. It is difficult to get a man to understand something, when his salary depends upon his not understanding it." - Upton Sinclair