Viewing a single comment thread. View all comments

farmingvillein t1_izxuivo wrote

> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why

Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.

Or am I missing something.

17

gwern t1_izxv8bw wrote

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

9

midasp t1_izybqp8 wrote

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

7

Akrenion t1_j011yxv wrote

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

3

ChuckSeven t1_j016l2h wrote

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

2

farmingvillein t1_izy22t6 wrote

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

4

mgostIH t1_j02vbuy wrote

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.

1