Submitted by AdministrationOk2735 t3_11229f7 in MachineLearning
As I'm learning about how stable diffusion works, I can't figure out why during image generation there's a need to deal with 'noise'.
I know I'm glossing over a lot of details, but my understanding is that the algorithm is trained by gradually adding noise to an image and then de-noising it to recover the initial image. Wouldn't this be functionally equivalent to a machine that starts with an image, gradually reduces it to a blank canvas (all white), and then gradually reconstructs the original image? Then, post training, the generative process would just start with a blank canvas and gradually generate the image based on the input string provided.
The idea of generating an image from a blank canvas feels more satisfying to me than revealing an image hidden by noise, but I'm sure there's a mathematical/technical reason why what I'm suggesting doesn't work. Appreciate any insight into this!
gopher9 t1_j8hl55i wrote
There's a paper that does that and also other transformations as well: https://arxiv.org/pdf/2208.09392.pdf