Red-Portal t1_it516cq wrote on October 20, 2022 at 11:58 PM

Reply to comment by WallabyDue2778 in [D] DDPM vs Score Matching by WallabyDue2778

I actually think it's the opposite. Although the "learning the noise" part is voodoo, the probabilistic model itself is quite sound, if you're slightly Bayesian. What DDPM is doing is that, assuming the transition is Gaussian, then let's find this Gaussian. There's nothing inherently wrong about this since you're conditioning on the assumption. I have a problem with the "let's learn the noise like psychopaths" part too; but I think it has something to do with scaling of the variational objective. Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

UncleVesem1r t1_it5bxzm wrote on October 21, 2022 at 1:18 AM

Thank you for the reply. It was very helpful.

>Score matching on the other hand, has no theoretical guarantee that it will produce something accurate enough to be used for Langevin sampling.

Sorry if I'm being dense. Could you expound on this? Or could you be more explicit regarding which part of DDPM provides such theoretical guarantee, while SBM fails to do so, perhaps providing eq numbers from the papers? I'm fairly new to this and it's hard for me to parse all the equations and understand which part is fluff and which is the real meat. Thank you very much!

Red-Portal t1_it5ciyb wrote on October 21, 2022 at 1:22 AM

DDPM doesn't aim to produce anything related to Langevin sampling. However, it's objective function is equivalent to the KL divergence between the "true" Gaussian and the neural network parameterized Gaussian. Thus, as long as SGD does optimize the DDPM objective, you'll get something that is close to the true Gaussian according to the KL divergence. The problem is that learning the noise with MSE kindda ruins all of this...

UncleVesem1r t1_it5rffe wrote on October 21, 2022 at 3:15 AM

I see! I understand why DDPM is good now. I should go back to the paper and pay more attention to the KL divergence part of it.

If I could borrow a few more minutes of your time, could you explain more about what's not as good about score matching?

So to be explicit, my understanding Langevin sampling is correct, i.e., if there's a model that can accurately model the score function, one should be able to recover the true data distribution. If this is true, then I guess the criticism regarding SM is about its objective function, i.e., there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Or perhaps Langevin sampling is the problem. The paper does say that with small enough noise and enough steps, we would end up in an exact sample from the data set. Yet if we don't have small enough noise and enough steps, perhaps we end up somewhere but it doesn't guarantee to be the true data distribution?

I really appreciate this! Thanks again.

Red-Portal t1_it5v31k wrote on October 21, 2022 at 3:46 AM

>there's no guarantee that it leads to accurate score function? But aren't the score matching algorithms (denoising, projection) supposed to be able to solve the objective function involving grad_x log p(x)?

Oh no it's not. All it's doing is to minimize the mean-squares error against the score function. Minimizing this objective does not mean sampling using this score function will be a good idea; which it isn't. This is exactly why score modelling has to rely on adding noise. And by doing this, they converged to DDPM.

UncleVesem1r t1_it5wdui wrote on October 21, 2022 at 3:58 AM

Very cool! I think the pitfalls mentioned in the SM paper also make more sense now.

Thank you kind sir/madam