As I'm learning about how stable diffusion works, I can't figure out why during image generation there's a need to deal with 'noise'.

I know I'm glossing over a lot of details, but my understanding is that the algorithm is trained by gradually adding noise to an image and then de-noising it to recover the initial image. Wouldn't this be functionally equivalent to a machine that starts with an image, gradually reduces it to a blank canvas (all white), and then gradually reconstructs the original image? Then, post training, the generative process would just start with a blank canvas and gradually generate the image based on the input string provided.

The idea of generating an image from a blank canvas feels more satisfying to me than revealing an image hidden by noise, but I'm sure there's a mathematical/technical reason why what I'm suggesting doesn't work. Appreciate any insight into this!

Comments

You must log in or register to comment.

NoLifeGamer2 t1_j8hmag2 wrote on February 14, 2023 at 11:08 AM

To my understanding, if you use noise, then you can generate different images using the same algorithm, just by changing the noise. If you have a blank canvas, there is only 1 initial starting position (blank), so there would be only 1 output image.

2blazen t1_j8i5fyx wrote on February 14, 2023 at 2:17 PM

That was my understanding as well, noise ensures "randomness"

[deleted] t1_j8if3mw wrote on February 14, 2023 at 3:26 PM

[deleted]

gopher9 t1_j8hl55i wrote on February 14, 2023 at 10:52 AM

There's a paper that does that and also other transformations as well: https://arxiv.org/pdf/2208.09392.pdf

tdgros t1_j8hn77o wrote on February 14, 2023 at 11:20 AM

This one as well: https://openreview.net/pdf?id=QsVditUhXR

AnotsuKagehisa t1_j8i3ttc wrote on February 14, 2023 at 2:04 PM

Its a lot easier to create a variety of shapes this way, instead of being stuck with a predetermined shape.

teenaxta t1_j8qvnx0 wrote on February 16, 2023 at 8:21 AM

I think this has more to do with probability, the sum of all random variables approaches a gaussian distribution. We can prove it using Central limit theorem. So what that really means is that the noise can map all sorts of information. Also when you add noise consistently, at one point you reach the normal distribution however, the noise pattern at hand is unique. Think of it as this way, 0,0 have a mean of 0 while -1,1 also have a mean of 0. The unique noise pattern actually contains useful information where as if you were to create a blank canvas, your generator would have no idea about what to generate from it for it is a many to one mapping. The additive noise process is a unique mapping

martianunlimited t1_j8tm2qd wrote on February 16, 2023 at 9:27 PM

This is an ELI5 explanation as to why we use noise and conditionally denoise the noise with the text encoder: Look at the clouds, and I tell you that I see an elephant in the clouds. It is easier to imagine the elephant in the clouds than if i tell you to imagine that there is an elephant in the piece of white paper.

(the less ELI5 explanation is that the entropy going from noise to an image is lower than that of from a uniform image) If you want to see that for yourself, with a bit of programming knowledge you can write your own diffuser pipeline to skip the noise adding stage and try img2img from a blank image. (it's literally just ~3 lines of edits)

(side note: someone brought up a similar question but in a different vein, (removing the random seed)