Viewing a single comment thread. View all comments

ChrisRackauckas OP t1_iswr0wc wrote

(1) while running your primal program, you run another problem that is propagating infinitesimal probabilities of certain pieces changing, and then it chooses the flips according to the right proportion (as derived in the paper) to give two correlated but different runs to difference for Y(p). But this Y(p) is defined to have the property that E[Y(p)]=dE[X(p)]/dp with a low variance, so you do this a few times and that thing is your gradient estimate. (2) unlike previous other algorithms with known exponential cost scaling (for example, see https://openreview.net/forum?id=KAFyFabsK88 for a deep discussion on previous work's performance), this scales linearly. 1024 should be fine. Note that this is related to forward mode AD so "really big" needs more work, but that size is fine.

2

schwagggg t1_isxz4cw wrote

then this sounds like measure valued derivative a bit? you perturb then calculate derivative. then wouldn’t this be at least O(D) expensive for one layer, and O(LD) for L layers of D dim rvs?

1

ChrisRackauckas OP t1_isy96fg wrote

O(LD) yes, so yeah you want reverse mode O(L+D) but without bias and at a low variance, and that's the next steps here.

1