enryu42 t1_j6wme0p wrote on February 2, 2023 at 12:23 PM

Nice! It is pretty clear that big models memorize some of their training examples, but the ease of extraction is impressive.

I wonder what would be the best mitigation strategies (besides the obvious one of de-duplicating training images). Theoretically sound approaches (like differential privacy) will perhaps cripple the training too much. I wonder if some simple hacks would work: e.g. train the model as-is first, then generate an entirely new training set using the model and synthetic prompts, and train a new model from scratch only on the generated data.

Another aspect of this is on the user experience side. People can reproduce copyrighted images with just pen and paper, but they'll be fully aware of what they're doing in such case. With diffusion models, the danger is, the user can reproduce an existing image without realizing it. Maybe augmenting the various UI's with reverse image search/nearest neighbor lookup would be a good idea? Or computing training set attributions for generated images with something along the lines of tracin.