vyasnikhil96

vyasnikhil96 OP t1_j9mu7ak wrote

Thanks for the interesting link! For the kind of copyright in your link deduplication of the data might be hard i.e. to assume that works which have access to this copyright only occur one or a few times, since it is a character and we don't really know which characters are copyrighted and which are not. But our paper assumes deduplication has been done beforehand.

Coming back to our notion, I am not trying to say that there is a already established information theoretic notion of copyright.

The copyright law (as we state in paper) relies on two things: 1. access to the copyrighted work must be proved and 2. substantial similarity to the copyrighted work must be established. Our notion cleanly separates these two and we come up with a information-theoretic way to quantify the "substantial similarity" aspect (denoted by k-NAF). For example, the strongest setting of k = 0 will never violate copyright (but also might degrade performance or be impossible to achieve) because it is equivalent to not having access. Larger values of k tradeoff between model performance and a possible increase in "substantial similarity". What k is valid for which setting to prevent copyright violation is not something we are establishing but rather that depends on the specific setting and must be determined by the law. The user can tune the value of k (assuming feasibility of the value) to the value considered acceptable by the law.

2

vyasnikhil96 OP t1_j9ltbq3 wrote

I agree. Note that overall there are two things we can hope for: 1. Using this approach with a appropriate k removes most of the "obvious" copyright violations and 2. for the remaining images the value k can be interpreted to determine whether there was a copyright violation or not, where the interpretation will necessarily be application and context dependent.

3