Submitted by AdministrationOk2735 t3_11229f7 in MachineLearning

As I'm learning about how stable diffusion works, I can't figure out why during image generation there's a need to deal with 'noise'.

I know I'm glossing over a lot of details, but my understanding is that the algorithm is trained by gradually adding noise to an image and then de-noising it to recover the initial image. Wouldn't this be functionally equivalent to a machine that starts with an image, gradually reduces it to a blank canvas (all white), and then gradually reconstructs the original image? Then, post training, the generative process would just start with a blank canvas and gradually generate the image based on the input string provided.

The idea of generating an image from a blank canvas feels more satisfying to me than revealing an image hidden by noise, but I'm sure there's a mathematical/technical reason why what I'm suggesting doesn't work. Appreciate any insight into this!

0

Comments

You must log in or register to comment.

NoLifeGamer2 t1_j8hmag2 wrote

To my understanding, if you use noise, then you can generate different images using the same algorithm, just by changing the noise. If you have a blank canvas, there is only 1 initial starting position (blank), so there would be only 1 output image.

18

AnotsuKagehisa t1_j8i3ttc wrote

Its a lot easier to create a variety of shapes this way, instead of being stuck with a predetermined shape.

2

teenaxta t1_j8qvnx0 wrote

I think this has more to do with probability, the sum of all random variables approaches a gaussian distribution. We can prove it using Central limit theorem. So what that really means is that the noise can map all sorts of information. Also when you add noise consistently, at one point you reach the normal distribution however, the noise pattern at hand is unique. Think of it as this way, 0,0 have a mean of 0 while -1,1 also have a mean of 0. The unique noise pattern actually contains useful information where as if you were to create a blank canvas, your generator would have no idea about what to generate from it for it is a many to one mapping. The additive noise process is a unique mapping

1

martianunlimited t1_j8tm2qd wrote

This is an ELI5 explanation as to why we use noise and conditionally denoise the noise with the text encoder: Look at the clouds, and I tell you that I see an elephant in the clouds. It is easier to imagine the elephant in the clouds than if i tell you to imagine that there is an elephant in the piece of white paper.

(the less ELI5 explanation is that the entropy going from noise to an image is lower than that of from a uniform image) If you want to see that for yourself, with a bit of programming knowledge you can write your own diffuser pipeline to skip the noise adding stage and try img2img from a blank image. (it's literally just ~3 lines of edits)

(side note: someone brought up a similar question but in a different vein, (removing the random seed)

1