gkaykck t1_j1v72ay wrote on December 27, 2022 at 5:25 PM

Reply to comment by Featureless_Bug in [Discussion] 2 discrimination mechanisms that should be provided with powerful generative models e.g. ChatGPT or DALL-E by Exnur0

I think if this is going to be implemented, it has to be at model level, not as an extra layer on top. Just thinking outloud with my not so great ML knowledge, if we mark every image in training data with some special and static "noise" which is unnoticable to human eyes, all the images generated will be marked with the same "noise". So this is for running open source alternatives on your own cluster. I think if this kind of "watermarking" will be implemented, it needs to be done in the model itself.

When it comes to "why would OpenAI do it", it would be nice for them to be able to track where does their generated pictures/content end up to for investors etc. This can also help them "license" the images generated with their models instead of charging per run.

Exnur0 OP t1_j1vgj6f wrote on December 27, 2022 at 6:26 PM

You don't actually have to watermark images in order to know that you generated them, at least not if you're checking exactly the same image - you can just hash the image, or store a low-dimension representation of it as a fingerprint (people sometimes use color histograms, in principle you could use anything). Then, you can look up images against that data to see if it's one of the ones you produced.

Brudaks t1_j21x8ut wrote on December 29, 2022 at 1:39 AM

Thing is, we can't really do that for text, natural language doesn't no free variation where you could insert a sufficient bits of special noise unnoticable to human eyes. Well, you might add some data with various formatting or unicode trickery, but that would be trivially removable by anyone who cared.

Featureless_Bug t1_j1veefz wrote on December 27, 2022 at 6:12 PM

>I think if this is going to be implemented, it has to be at model level, not as an extra layer on top. Just thinking outloud with my not so great ML knowledge, if we mark every image in training data with some special and static "noise" which is unnoticable to human eyes, all the images generated will be marked with the same "noise".

This is already wrong - it might work, it might not work

>So this is for running open source alternatives on your own cluster.

Well, of course the open source models will be trained on data without any noise added, people are not stupid

>When it comes to "why would OpenAI do it", it would be nice for them to be able to track where does their generated pictures/content end up to for investors etc. This can also help them "license" the images generated with their models instead of charging per run.

Well, open AI won't do it because no one wants watermarked images. Consequently, if they tried to watermark their outputs, people will be even more likely to use open-source alternatives. That's why open AI won't do it

Eggy-Toast t1_j1vhtgg wrote on December 27, 2022 at 6:34 PM

“This is already wrong — it might work” disingenuous much?

The point of that proposed watermark is that it can be imperceptible to the human eye but perceptible by some algorithm or model nevertheless. It only adds value to the product, but perhaps not as much as it would take to implement.

I think in your comments though you entirely overlooked the fact that DALLE 2 has watermark implementation and it is in no way subtle, but it can be cropped out.