dasayan05 t1_it6u4ho wrote on October 21, 2022 at 10:58 AM

Reply to comment by WallabyDue2778 in [D] DDPM vs Score Matching by WallabyDue2778

To clarify, "score matching" itself is quite theoretically grounded -- what is not, is the fact that score matching and langevin dymanics is not theoretically "coupled". Langevin dynamics is chosen more like an intuitive way of "using" the score-estimates. Moreover, langevin dynamics theretically takes infinite time to reach the true distribution and it's convergence depends on proper choice of `\delta`, a tiny number that acts like step size.

x_{t-1} = x_t + s(x_t, t) \delta / 2 + sqrt{\delta} z

Now, look at DDPM. DDPM's training objective is totally "coupled" with it's sampling process -- it all comes from very standard calculations on the underlying PGM (probabilistic graphical model). Notice that DDPMs reverse process do not involve a hyperparam like `\delta`, everything is tied to the known \beta schedule -- which tells you what exact step size to take in order to converge in finitely many (T) steps. DDPM's reverse process is not langevin dynamics -- it just looks like it, but has stronger gurantee on convergence.

This makes it more robust compared to Score based langevin dynamics.

UncleVesem1r t1_it801hk wrote on October 21, 2022 at 4:26 PM

Thank you! My intuition was that score matching + Langevin doesn’t have a forward diffusion process, which probably contributed to why there has to be a step size (right?) and I agree that LD seemed to be an easy way to use the scores.

How about the SDE formulation of score matching? They also claimed that DDPM is a variance preserving discretization SDE. As far as I can tell, the reverse SDE is a closed form solution of forward SDE and doesn’t require extra hyper parameters.

dasayan05 t1_it95xq7 wrote on October 21, 2022 at 9:07 PM

IMO, forward diffusion process isn't really a "process" -- it's need not be sequential, it's parallelizable. The sole purpose of forward process is simulating noisy data from a set of "noisy data distributions" crafted with a known set of noise-scales -- that's it. SBM and DDPM both have this process. For SBMs, it is again a heuristic HP to choose the correct largest scale so that it can overpower the data variance and reach an uninformative prior. For DDPM, it always reaches the prior due to the way noise-scales and attenuation coefficients are computed from \beta_t.

Agree with your second part. SDE formulation is good -- it basically brings SBMs into a more stronger theoretical framework. SDEs offer a reverse process which is analytic where the score naturally appears -- i.e. again not much HP.