Red-Portal t1_it516cq wrote on October 20, 2022 at 11:58 PM

I actually think it's the opposite. Although the "learning the noise" part is voodoo, the probabilistic model itself is quite sound, if you're slightly Bayesian. What DDPM is doing is that, assuming the transition is Gaussian, then let's find this Gaussian. There's nothing inherently wrong about this since you're conditioning on the assumption. I have a problem with the "let's learn the noise like psychopaths" part too; but I think it has something to do with scaling of the variational objective. Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

UncleVesem1r t1_it5bxzm wrote on October 21, 2022 at 1:18 AM

Thank you for the reply. It was very helpful.

>Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

Sorry if I'm being dense. Could you expound on this? Or could you be more explicit regarding which part of DDPM provides such theoretical guarantee, while SBM fails to do so, perhaps providing eq numbers from the papers? I'm fairly new to this and it's hard for me to parse all the equations and understand which part is fluff and which is the real meat. Thank you very much!

Red-Portal t1_it5ciyb wrote on October 21, 2022 at 1:22 AM

DDPM doesn't aim to produce anything related to Langevin sampling. However, it's objective function is equivalent to the KL divergence between the "true" Gaussian and the neural network parameterized Gaussian. Thus, as long as SGD does optimize the DDPM objective, you'll get something that is close to the true Gaussian according to the KL divergence. The problem is that learning the noise with MSE kindda ruins all of this...

UncleVesem1r t1_it5rffe wrote on October 21, 2022 at 3:15 AM

I see! I understand why DDPM is good now. I should go back to the paper and pay more attention to the KL divergence part of it.

If I could borrow a few more minutes of your time, could you explain more about what's not as good about score matching?

So to be explicit, my understanding Langevin sampling is correct, i.e., if there's a model that can accurately model the score function, one should be able to recover the true data distribution. If this is true, then I guess the criticism regarding SM is about its objective function, i.e., there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Or perhaps Langevin sampling is the problem. The paper does say that with small enough noise and enough steps, we would end up in an exact sample from the data set. Yet if we don't have small enough noise and enough steps, perhaps we end up somewhere but it doesn't guarantee to be the true data distribution?

I really appreciate this! Thanks again.

Red-Portal t1_it5v31k wrote on October 21, 2022 at 3:46 AM

>there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Oh no it's not. All it's doing is to minimize the mean-squares error against the score function. Minimizing this objective does not mean sampling using this score function will be a good idea; which it isn't. This is exactly why score modelling has to rely on adding noise. And by doing this, they converged to DDPM.

UncleVesem1r t1_it5wdui wrote on October 21, 2022 at 3:58 AM

Very cool! I think the pitfalls mentioned in the SM paper also make more sense now.

Thank you kind sir/madam

dasayan05 t1_it6u4ho wrote on October 21, 2022 at 10:58 AM

To clarify, "score matching" itself is quite theoretically grounded -- what is not, is the fact that score matching and langevin dymanics is not theoretically "coupled". Langevin dynamics is chosen more like an intuitive way of "using" the score-estimates. Moreover, langevin dynamics theretically takes infinite time to reach the true distribution and it's convergence depends on proper choice of `\delta`, a tiny number that acts like step size.

x_{t-1} = x_t + s(x_t, t) \delta / 2 + sqrt{\delta} z

Now, look at DDPM. DDPM's training objective is totally "coupled" with it's sampling process -- it all comes from very standard calculations on the underlying PGM (probabilistic graphical model). Notice that DDPMs reverse process do not involve a hyperparam like `\delta`, everything is tied to the known \beta schedule -- which tells you what exact step size to take in order to converge in finitely many (T) steps. DDPM's reverse process is not langevin dynamics -- it just looks like it, but has stronger gurantee on convergence.

This makes it more robust compared to Score based langevin dynamics.

UncleVesem1r t1_it801hk wrote on October 21, 2022 at 4:26 PM

Thank you! My intuition was that score matching + Langevin doesn’t have a forward diffusion process, which probably contributed to why there has to be a step size (right?) and I agree that LD seemed to be an easy way to use the scores.

How about the SDE formulation of score matching? They also claimed that DDPM is a variance preserving discretization SDE. As far as I can tell, the reverse SDE is a closed form solution of forward SDE and doesn’t require extra hyper parameters.

dasayan05 t1_it95xq7 wrote on October 21, 2022 at 9:07 PM

IMO, forward diffusion process isn't really a "process" -- it's need not be sequential, it's parallelizable. The sole purpose of forward process is simulating noisy data from a set of "noisy data distributions" crafted with a known set of noise-scales -- that's it. SBM and DDPM both have this process. For SBMs, it is again a heuristic HP to choose the correct largest scale so that it can overpower the data variance and reach an uninformative prior. For DDPM, it always reaches the prior due to the way noise-scales and attenuation coefficients are computed from \beta_t.

Agree with your second part. SDE formulation is good -- it basically brings SBMs into a more stronger theoretical framework. SDEs offer a reverse process which is analytic where the score naturally appears -- i.e. again not much HP.

[D] DDPM vs Score Matching

dasayan05 t1_it46pby wrote on October 20, 2022 at 8:22 PM

WallabyDue2778 OP t1_it4zvey wrote on October 20, 2022 at 11:48 PM