Viewing a single comment thread. View all comments

vyasnikhil96 OP t1_j9ksj4v wrote

We already handle that as our notion is not based on reproduction but is rather a information-theoretic notion. We also have a parameter that measures how much information we have "reproduced" vs adapted which can be set depending on the underlying models and the use case.


sam__izdat t1_j9mgsbr wrote

There's no information-theoretic notion of character copyright, for example. It's a game of calvinball, and has been since the Stationers' Company. It's true that copyright is badly misunderstood and over-generalized to things that it has absolutely nothing to do with, like plagiarism and other notions of (nonexistent) authorship rights, but it isn't a measurable thing either and you can't guarantee that the law and policy will agree with your model.


WikiSummarizerBot t1_j9mgtut wrote

Copyright protection for fictional characters

>Copyright protection is available to the creators of a range of works including literary, musical, dramatic and artistic works. Recognition of fictional characters as works eligible for copyright protection has come about with the understanding that characters can be separated from the original works they were embodied in and acquire a new life by featuring in subsequent works.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)


vyasnikhil96 OP t1_j9mu7ak wrote

Thanks for the interesting link! For the kind of copyright in your link deduplication of the data might be hard i.e. to assume that works which have access to this copyright only occur one or a few times, since it is a character and we don't really know which characters are copyrighted and which are not. But our paper assumes deduplication has been done beforehand.

Coming back to our notion, I am not trying to say that there is a already established information theoretic notion of copyright.

The copyright law (as we state in paper) relies on two things: 1. access to the copyrighted work must be proved and 2. substantial similarity to the copyrighted work must be established. Our notion cleanly separates these two and we come up with a information-theoretic way to quantify the "substantial similarity" aspect (denoted by k-NAF). For example, the strongest setting of k = 0 will never violate copyright (but also might degrade performance or be impossible to achieve) because it is equivalent to not having access. Larger values of k tradeoff between model performance and a possible increase in "substantial similarity". What k is valid for which setting to prevent copyright violation is not something we are establishing but rather that depends on the specific setting and must be determined by the law. The user can tune the value of k (assuming feasibility of the value) to the value considered acceptable by the law.


sam__izdat t1_j9ngakh wrote

I don't have any technical criticism that would be useful to you (and frankly it's above my pay grade), but to expand on what I meant when I said that it's a game of calvinball, there's some history here worth considering. Copyright has gone through myriad justifications.

If we wanted to detect offending content by the original standards of the Stationers' Company, then it may be useful to look for signs of sedition and heresy, since the stated purpose was "to stem the flow of seditious and heretical texts."

By the justification of the liberals who came after, typesetting, being a costly and error-prone process, forced their hand to protect the integrity of the text. So, if for some reason we wanted to take that goal seriously, it might make sense to look for certain kinds of dissimilarity instead: errors and distortions in reproductions. After all, that was the social purpose of the monopoly right.

If the purpose of the copyright regime today is to secure the profits of private capital in perpetuity, then simple metrics of similarity aren't enough to guarantee a virtual Blackstonian land right either.

For example:

> In our discussions, we refer to C ∈ C abstractly as a “piece of copyrighted data”, but do not specify it in more detail. For example, in an image generative model, does C correspond to a single artwork, or the full collected arts of some artists? The answer is the former. The reason is that if a generative model generates data that is influenced by the full collected artworks of X, but not by any single piece, then it is not considered a copyright violation. This is due to that it is not possible to copyright style or ideas, only a specific expression. Hence, we think of C as a piece of content that is of a similar scale to the outputs of the model.

That sounds reasonable. Is it true?

French and Belgian IP laws, for example, consider taking an original photo of a public space showing protected architecture a copyright violation. Prior to mid 2016, taking a panoramic photo with the Atomium in the background was copyright infringement. Distributing a night photo of the Eiffel tower is still copyright infringement today. So, how would you guarantee that a diffusion model fall within the boundaries of arbitrary rules when those tests of "substantial similarity" suddenly become a lot more ambiguous than anticipated?