farmingvillein t1_izxuivo wrote on December 12, 2022 at 5:32 PM

> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why

Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.

Or am I missing something.

gwern t1_izxv8bw wrote on December 12, 2022 at 5:37 PM

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

midasp t1_izybqp8 wrote on December 12, 2022 at 7:21 PM

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

Akrenion t1_j011yxv wrote on December 13, 2022 at 8:27 AM

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

ChuckSeven t1_j016l2h wrote on December 13, 2022 at 9:33 AM

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

farmingvillein t1_izy22t6 wrote on December 12, 2022 at 6:20 PM

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

mgostIH t1_j02vbuy wrote on December 13, 2022 at 6:11 PM

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.

DeepNonseNse t1_izxxdf0 wrote on December 12, 2022 at 5:50 PM

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

Phoneaccount25732 t1_izzmxc0 wrote on December 13, 2022 at 12:36 AM

Useful in what are now operator learning contexts maybe.

ShepardRTC t1_izxiqr1 wrote on December 12, 2022 at 4:15 PM

The blackbox part is very interesting. Perhaps that will open up new avenues that no one had ever thought of.

SoylentRox t1_izy1csf wrote on December 12, 2022 at 6:15 PM

I have thought this is how we get to really robust, high performance AGI. It seems so obvious.

The steps are:. Have a test environment with a diverse number of auto graded tasks requiring varying levels of skill and cognition. "Big bench" but bigger, call it AGI gym.

The "AGI hypothesis" is an architecture of architectures : it's a set of components interconnected in some way, and those components came from a "seed library" or were auto discovered in another step as a composition of seed components.

The files to define a possible "AGI candidate" are simple and made to be manipulable as an output on an AGI gym task....

Recursion....

You see the idea. So basically I think truly effective AGI architectures are going to be very complex and human hypotheses are wrong. So you find them recursively using prior AGIs that did well on "AGI gym" which includes tasks to design other AGIs among the graded challenges...

Note at the end of the day you end up with a model that does extremely well at "AGI gym". With careful selection of the score heuristic we can select for models that are, well, general and as simple as possible.

It doesn't necessarily have any science fiction abilities, only it will do extremely well at tasks that are mutations of the gym task. If some of them are robotics tasks with realistic simulated input from the real world, it would do well in the real world at those tasks also.

Some of the tasks would be to "read this description of what I want you to do in the simulated world and do it with this robot". And the descriptions are procedurally generated from a very large set.

The whole process would be ongoing - each commit makes AGI gym harder, and the population of successful models gets ever more capable. You fund this by selling the services of the current best models.

[D] G. Hinton proposes FF – an alternative to Backprop

ShepardRTC t1_izwzjw5 wrote on December 12, 2022 at 1:55 PM

JackandFred t1_izx3k5r wrote on December 12, 2022 at 2:27 PM

gwern t1_izxgfqm wrote on December 12, 2022 at 3:59 PM