Viewing a single comment thread. View all comments

schwagggg t1_iswqh92 wrote

cool stuff!

2 things:

  1. i am still trying to wrap my head around how to do this stuff: say we have an 2 layer NN with Bernoulli neurons, how do you take derivative wrt to the first layer’s weight in this case?

  2. seems to me that this approach needs many function evaluations, does it scare well wrt # stochastic variables? if i use it for a VAE with expensive decoder and say 1024 stochastic latents, would it be bad?

1

ChrisRackauckas OP t1_iswr0wc wrote

(1) while running your primal program, you run another problem that is propagating infinitesimal probabilities of certain pieces changing, and then it chooses the flips according to the right proportion (as derived in the paper) to give two correlated but different runs to difference for Y(p). But this Y(p) is defined to have the property that E[Y(p)]=dE[X(p)]/dp with a low variance, so you do this a few times and that thing is your gradient estimate. (2) unlike previous other algorithms with known exponential cost scaling (for example, see https://openreview.net/forum?id=KAFyFabsK88 for a deep discussion on previous work's performance), this scales linearly. 1024 should be fine. Note that this is related to forward mode AD so "really big" needs more work, but that size is fine.

2

schwagggg t1_isxz4cw wrote

then this sounds like measure valued derivative a bit? you perturb then calculate derivative. then wouldn’t this be at least O(D) expensive for one layer, and O(LD) for L layers of D dim rvs?

1

ChrisRackauckas OP t1_isy96fg wrote

O(LD) yes, so yeah you want reverse mode O(L+D) but without bias and at a low variance, and that's the next steps here.

1