ZeronixSama

ZeronixSama OP t1_iuitqgd wrote

Ok, I think this blog post helped me understand: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Essentially the idea is that tractable log likelihoods are usually not flexible enough to capture rich structure in datasets and vice versa. So explicitly trying to model the log likelihood for such datasets is a doomed endeavour, but modelling the gradient of log likelihood is both tractable and flexible 'enough' to be practically useful.

P.S. That does make me wonder, if it's turtles all the way down... In a sense, distributions whose grad(log-likelihood) can be tractably modelled could also argued to be less flexible than distributions which don't fall within this class, and so in the future there may be some second-order diffusion method that operates on the grad(grad(log-likelihood)) instead. Downside is huge compute required for second derivative, but upside could be much more flexible modelling capability

4