Comments

You must log in or register to comment.

tysam_and_co t1_izwgts5 wrote

Great paper. I'd love to play with the concepts in it one day. That would be cool. Looks like a new paradigm, and as is usual for Hinton, has several decades of thought behind it. A little bit worried about the more philosophical bent he took near the end -- he's getting older and he usually does not have that grave kind of tone in his papers, as I can recall from my personal experience. I hope he, and all he is with and that is around him, is well. :') :')

25

pr0u t1_izwh94p wrote

Inb4 Schmidhuber invented this 20 years ago

114

rehrev t1_izwhvcp wrote

Dude who tf thinks brains learn with backprop

−14

sea-shunned t1_izwkg9p wrote

For the purpose of designing something more akin to biology, does anyone know why there is no consideration of spiking neural networks (SNNs) in this proposal? Is it just that they are far less practical to implement than the Forward-Forward algorithm (and thus less attractive)? My understanding was that SNNs are more biologically realistic, but maybe not enough to be pursued.

12

IntelArtiGen t1_izwl2wr wrote

>Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.

In a way you can also do that with regular NN. Usually we do "long training phase (many backprops) => only test phase". But we can do "backprop => test => backprop => test ..." if it applies to our task (it usually doesn't), simultaneously training and using one model.

Also it's always interesting to try new things but many propositions seemed to work on small image datasets like MNIST or CIFAR-10. For small neural networks and datasets with small inputs, there is always a possibility that the neural network will find a good weight "by chance", and that with enough computing power it'll converge. But, for large networks and large images, these solutions usually don't scale, I think it's important to try these solutions on ImageNet to evaluate how they scale (and to try to make them scale). What made backprop so popular is its ability to scale for very large networks and images.

7

aleph__one t1_izwos7v wrote

I use a custom SNN variant in production on real use cases, and the way we train those is very similar to the FF proposal. Most people just assume SNNs are impossible to train because SGD isn’t immediately available, when in reality there are dozens of ways to train SNNs to achieve solid performance.

23

aleph__one t1_izwyrcf wrote

Yea the surrogate gradient stuff works ok, others that are decent 1) STDP variants, especially dopamine modulated STDP (emulates RL-like reinforcement) 2) for networks < 10M params, evolution strategies and similar zero-order solvers can work well operating directly on the weights 3) variational solvers can work if you structure the net + activations appropriately

12

jigarthanda-paal t1_izwz4ss wrote

Seems similar to contrastive learning in the way that the artificial dataset is created.

3

ShepardRTC t1_izwzjw5 wrote

>The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.

Key quote. Still very cool though! Perhaps he can take it further and make it better than backprop.

119

AsIAm t1_izx39lx wrote

His take on hardware for neural nets is pretty forward(-forward) thinking. Neural nets started by being analog (Rosenblatt's' Perceptron) and only later we started simulating them in software on digital computers. Some recent research (1,2) suggest that physical implementation of learnable neural nets is possible and way more efficient in analog circuits. This means that we could run extremely large nets on a tiny chip. Which could live in your toaster, or your skull.

15

JackandFred t1_izx3k5r wrote

Rather than a direct competitor, I wonder if there would be a use case where you might use both backprop and ff at different times and could get good results. So it wouldn’t have to be directly better than backprop, it could be only better for certain use cases

34

IntelArtiGen t1_izxdej3 wrote

I agree it's unperfect, as we are. When I tried to do it, I was still able to maintain a bit of knowledge in the network but I had to continously re-train on previous data.

It's hard to do "info1,2,3 => train => info4,5,6 => train => info7,8,9 => train [etc.]" and have the model remember info1,2,3

But you can do "info1,2,3 => train => info4,5,1 => train => info6,7,2 => train [etc.]". I used a memory to retain previous information and continously train the network on it and it works. Of course it's slower because you don't process all the new information, you mix it with old information. I guess there are better ways to do it.

1

gwern t1_izxgfqm wrote

It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why, but if it does, that might be interesting for RL and for symbolic purposes: just throw in 'components' like calculators or constrained optimization solvers to augment the native net. (If you can just throw them into your existing net and train with FF as usual, without having to worry about differentiating it or tacking on explicit RL, that would be very freeing.)

23

based_goats t1_izxnblf wrote

This seems like the progression of a fourier transform to a fast fourier transform.

3

farmingvillein t1_izxuivo wrote

> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why

Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.

Or am I missing something.

17

gwern t1_izxv8bw wrote

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

9

DeepNonseNse t1_izxxdf0 wrote

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

3

SoylentRox t1_izy1csf wrote

I have thought this is how we get to really robust, high performance AGI. It seems so obvious.

The steps are:. Have a test environment with a diverse number of auto graded tasks requiring varying levels of skill and cognition. "Big bench" but bigger, call it AGI gym.

The "AGI hypothesis" is an architecture of architectures : it's a set of components interconnected in some way, and those components came from a "seed library" or were auto discovered in another step as a composition of seed components.

The files to define a possible "AGI candidate" are simple and made to be manipulable as an output on an AGI gym task....

Recursion....

You see the idea. So basically I think truly effective AGI architectures are going to be very complex and human hypotheses are wrong. So you find them recursively using prior AGIs that did well on "AGI gym" which includes tasks to design other AGIs among the graded challenges...

Note at the end of the day you end up with a model that does extremely well at "AGI gym". With careful selection of the score heuristic we can select for models that are, well, general and as simple as possible.

It doesn't necessarily have any science fiction abilities, only it will do extremely well at tasks that are mutations of the gym task. If some of them are robotics tasks with realistic simulated input from the real world, it would do well in the real world at those tasks also.

Some of the tasks would be to "read this description of what I want you to do in the simulated world and do it with this robot". And the descriptions are procedurally generated from a very large set.

The whole process would be ongoing - each commit makes AGI gym harder, and the population of successful models gets ever more capable. You fund this by selling the services of the current best models.

0

farmingvillein t1_izy22t6 wrote

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

4

EDMismyO2 t1_izy6ydb wrote

A similar idea is used with experience replay in DQNs. For RL, it's important to ensure failure states are retained in the replay buffer so it keeps being reminded they are a failure or it starts to forget and then does dumb things. In RL the phenomenon is called 'catastrophic forgetting'.

1

midasp t1_izybqp8 wrote

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

7

seven-dev t1_izymtdy wrote

How does forward forward work in relation to backprop?

1

IshKebab t1_izys7ni wrote

The trouble with analogue is that it's not repeatable. Have fun debugging your code when it changes every time you run it.

I mean, I'm sure it's possible... It definitely doesn't sound pleasant though.

3

AllowFreeSpeech t1_izyuqhl wrote

Is this going to be another one of those throwaway ideas like "capsule networks"...

1

modeless t1_izzpcbe wrote

He calls it "mortal computation". Like instead of loading identical pretrained weights into every robot brain you actually train each brain individually, and then when they die their experience is lost. Just like humans! (Except you can probably train them in simulation, "The Matrix"-style.) But the advantage is that by relaxing the repeatability requirement you get hardware that is orders of magnitude cheaper and more efficient, so for any given budget it is much, much more capable. Maybe. I tend to think that won't be the case, but who knows.

5

arhetorical t1_izzryk4 wrote

Oh, I haven't heard about using SNNs for interpretability. I thought they were on the same level as ANNs. Sorry for all the questions, but can you elaborate on how they're more interpretable?

2

Akrenion t1_j011yxv wrote

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

3

ChuckSeven t1_j016l2h wrote

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

2

ChuckSeven t1_j016or8 wrote

I know this is a joke but Juergen doesn't have much work in this direction. He also doesn't care about biological plausibility. So I highly doubt that this will happen.

1

ChuckSeven t1_j016zp2 wrote

Doubt. I know the old stories too but large language models are essentially trained like that. Most never do an epoch twice and evaluate the model periodically.

1

modeless t1_j02fiss wrote

Without the requirement for exact repeatability you can use analog circuits instead of digital, and your manufacturing tolerances are greatly relaxed. You can use error-prone methods like self assembly instead of EUV photolithography in ten billion dollar cleanrooms.

Again, I don't really buy it but there's an argument to be made.

2

mgostIH t1_j02vbuy wrote

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.

1