Argamanthys t1_j6w9gal wrote on February 2, 2023 at 9:36 AM

Reply to comment by HateRedditCantQuitit in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

There is a short story called The Library of Babel about a near-infinite library that contains every possible permutation of a book with 1,312,000 characters. It is not hard to recreate that library in code. You can explore it if you want.

Contained within that library is a copy of every book ever written, freely available to read.

Is that book piracy? It's right there if you know where to look.

That's pretty much what's going on here. They searched the latent space for an image and found it. But that's because the latent space, like the Library of Babel is really big and contains not just that image but also near-infinite permutations of it.

SuddenlyBANANAS t1_j6waypu wrote on February 2, 2023 at 9:58 AM

If diffusion models were a perfect bijection between the latent space and the space of possible images, that would make sense, but they're obviously not. If you could repeat this procedure and find exact duplicates of images which were not in the training data, you'd have a point.

starstruckmon t1_j6xbhe1 wrote on February 2, 2023 at 3:40 PM

>find exact duplicates of images which were not in the training data, you'd have a point

The process isn't exactly the same, but isn't this how all the diffusion based editing techniques work?

WikiSummarizerBot t1_j6w9h7w wrote on February 2, 2023 at 9:36 AM

The Library of Babel

>"The Library of Babel" (Spanish: La biblioteca de Babel) is a short story by Argentine author and librarian Jorge Luis Borges (1899–1986), conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain format and character set. The story was originally published in Spanish in Borges' 1941 collection of stories El jardín de senderos que se bifurcan (The Garden of Forking Paths). That entire book was, in turn, included within his much-reprinted Ficciones (1944).

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

maxToTheJ t1_j6x4vrz wrote on February 2, 2023 at 2:55 PM

> That's pretty much what's going on here.

No its not. We wouldn’t need training sets if that was the case like in the scenario described where you can generate the dataset using a known algo