ShepardRTC t1_izwzjw5 wrote on December 12, 2022 at 1:55 PM

>The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.

Key quote. Still very cool though! Perhaps he can take it further and make it better than backprop.

JackandFred t1_izx3k5r wrote on December 12, 2022 at 2:27 PM

Rather than a direct competitor, I wonder if there would be a use case where you might use both backprop and ff at different times and could get good results. So it wouldn’t have to be directly better than backprop, it could be only better for certain use cases

gwern t1_izxgfqm wrote on December 12, 2022 at 3:59 PM

It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why, but if it does, that might be interesting for RL and for symbolic purposes: just throw in 'components' like calculators or constrained optimization solvers to augment the native net. (If you can just throw them into your existing net and train with FF as usual, without having to worry about differentiating it or tacking on explicit RL, that would be very freeing.)

farmingvillein t1_izxuivo wrote on December 12, 2022 at 5:32 PM

> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why

Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.

Or am I missing something.

gwern t1_izxv8bw wrote on December 12, 2022 at 5:37 PM

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

midasp t1_izybqp8 wrote on December 12, 2022 at 7:21 PM

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

Akrenion t1_j011yxv wrote on December 13, 2022 at 8:27 AM

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

ChuckSeven t1_j016l2h wrote on December 13, 2022 at 9:33 AM

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

farmingvillein t1_izy22t6 wrote on December 12, 2022 at 6:20 PM

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

mgostIH t1_j02vbuy wrote on December 13, 2022 at 6:11 PM

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.

DeepNonseNse t1_izxxdf0 wrote on December 12, 2022 at 5:50 PM

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

Phoneaccount25732 t1_izzmxc0 wrote on December 13, 2022 at 12:36 AM

Useful in what are now operator learning contexts maybe.

ShepardRTC t1_izxiqr1 wrote on December 12, 2022 at 4:15 PM

The blackbox part is very interesting. Perhaps that will open up new avenues that no one had ever thought of.

SoylentRox t1_izy1csf wrote on December 12, 2022 at 6:15 PM

I have thought this is how we get to really robust, high performance AGI. It seems so obvious.

The steps are:. Have a test environment with a diverse number of auto graded tasks requiring varying levels of skill and cognition. "Big bench" but bigger, call it AGI gym.

The "AGI hypothesis" is an architecture of architectures : it's a set of components interconnected in some way, and those components came from a "seed library" or were auto discovered in another step as a composition of seed components.

The files to define a possible "AGI candidate" are simple and made to be manipulable as an output on an AGI gym task....

Recursion....

You see the idea. So basically I think truly effective AGI architectures are going to be very complex and human hypotheses are wrong. So you find them recursively using prior AGIs that did well on "AGI gym" which includes tasks to design other AGIs among the graded challenges...

Note at the end of the day you end up with a model that does extremely well at "AGI gym". With careful selection of the score heuristic we can select for models that are, well, general and as simple as possible.

It doesn't necessarily have any science fiction abilities, only it will do extremely well at tasks that are mutations of the gym task. If some of them are robotics tasks with realistic simulated input from the real world, it would do well in the real world at those tasks also.

Some of the tasks would be to "read this description of what I want you to do in the simulated world and do it with this robot". And the descriptions are procedurally generated from a very large set.

The whole process would be ongoing - each commit makes AGI gym harder, and the population of successful models gets ever more capable. You fund this by selling the services of the current best models.

pr0u t1_izwh94p wrote on December 12, 2022 at 10:35 AM

Inb4 Schmidhuber invented this 20 years ago

qsnoodles t1_izzpcop wrote on December 13, 2022 at 12:54 AM

Inb4 Graham Sutherland comes back from the grave looking for whatever this photo/painting is

ChuckSeven t1_j016or8 wrote on December 13, 2022 at 9:35 AM

I know this is a joke but Juergen doesn't have much work in this direction. He also doesn't care about biological plausibility. So I highly doubt that this will happen.

[deleted] t1_izwhhrm wrote on December 12, 2022 at 10:38 AM

[deleted]

DeepGamingAI t1_izwkqsg wrote on December 12, 2022 at 11:22 AM

I liked a reply to jurgen from some twitter user saying if you have already solved agi now would be a good time to bring that up.

even_less_resistance t1_izx4eyw wrote on December 12, 2022 at 2:33 PM

Right dude? It's almost like he did and he is resentful of anyone touching on the path he used to get there... or he is almost there and doesn't want anyone catching up lol does he want the credit or not?

1bir t1_izwh1u6 wrote on December 12, 2022 at 10:32 AM

Implementations - Pytorch:
https://github.com/mohammadpz/pytorch_forward_forward
https://github.com/madcato/forward-forward-pytorch

MrAcurite t1_izxrgis wrote on December 12, 2022 at 5:12 PM

The actual paper

sheikheddy t1_izxu39a wrote on December 12, 2022 at 5:29 PM

I was at the Vector Institute around 2022-11-24 when he was giving a talk on this paper.

drinkscoffeedrinks t1_izy4mxm wrote on December 12, 2022 at 6:36 PM

Any cool insights from that talk?

tysam_and_co t1_izwgts5 wrote on December 12, 2022 at 10:28 AM

Great paper. I'd love to play with the concepts in it one day. That would be cool. Looks like a new paradigm, and as is usual for Hinton, has several decades of thought behind it. A little bit worried about the more philosophical bent he took near the end -- he's getting older and he usually does not have that grave kind of tone in his papers, as I can recall from my personal experience. I hope he, and all he is with and that is around him, is well. :') :')

AsIAm t1_izx39lx wrote on December 12, 2022 at 2:25 PM

His take on hardware for neural nets is pretty forward(-forward) thinking. Neural nets started by being analog (Rosenblatt's' Perceptron) and only later we started simulating them in software on digital computers. Some recent research (1,2) suggest that physical implementation of learnable neural nets is possible and way more efficient in analog circuits. This means that we could run extremely large nets on a tiny chip. Which could live in your toaster, or your skull.

IshKebab t1_izys7ni wrote on December 12, 2022 at 9:05 PM

The trouble with analogue is that it's not repeatable. Have fun debugging your code when it changes every time you run it.

I mean, I'm sure it's possible... It definitely doesn't sound pleasant though.

modeless t1_izzpcbe wrote on December 13, 2022 at 12:54 AM

He calls it "mortal computation". Like instead of loading identical pretrained weights into every robot brain you actually train each brain individually, and then when they die their experience is lost. Just like humans! (Except you can probably train them in simulation, "The Matrix"-style.) But the advantage is that by relaxing the repeatability requirement you get hardware that is orders of magnitude cheaper and more efficient, so for any given budget it is much, much more capable. Maybe. I tend to think that won't be the case, but who knows.

ChuckSeven t1_j016rtg wrote on December 13, 2022 at 9:36 AM

Why exactly is hardware cheaper and more efficient?

modeless t1_j02fiss wrote on December 13, 2022 at 4:32 PM

Without the requirement for exact repeatability you can use analog circuits instead of digital, and your manufacturing tolerances are greatly relaxed. You can use error-prone methods like self assembly instead of EUV photolithography in ten billion dollar cleanrooms.

Again, I don't really buy it but there's an argument to be made.

sea-shunned t1_izwkg9p wrote on December 12, 2022 at 11:19 AM

For the purpose of designing something more akin to biology, does anyone know why there is no consideration of spiking neural networks (SNNs) in this proposal? Is it just that they are far less practical to implement than the Forward-Forward algorithm (and thus less attractive)? My understanding was that SNNs are more biologically realistic, but maybe not enough to be pursued.

aleph__one t1_izwos7v wrote on December 12, 2022 at 12:11 PM

I use a custom SNN variant in production on real use cases, and the way we train those is very similar to the FF proposal. Most people just assume SNNs are impossible to train because SGD isn’t immediately available, when in reality there are dozens of ways to train SNNs to achieve solid performance.

captain_arroganto t1_izwufye wrote on December 12, 2022 at 1:10 PM

> spiking neural networks

Could you give more details. Books, articles, tutorials, application areas etc.

I am curious to explore this area.

Thanks.

aleph__one t1_izwxy4q wrote on December 12, 2022 at 1:41 PM

Unfortunately, beginner literature on this stuff is virtually nonexistent. Your best bet is to read papers and experiment.

ChuckSeven t1_j016ugh wrote on December 13, 2022 at 9:37 AM

Time for you to write a series of beginner tutorials for the community! Now would be a good time.

aleph__one t1_j01c1kf wrote on December 13, 2022 at 10:50 AM

Yea I was thinking the same thing. I teach some of this stuff at the graduate level but it’s tough for newcomers to get used to even in that setting.

Fried_out_Kombi t1_izy4sc4 wrote on December 12, 2022 at 6:37 PM

I'm a complete non-expert, just also trying to learn more about SNNs because they sound interesting, but I have this review paper from earlier this year bookmarked: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9313413/. Best I've found so far.

Celmeno t1_izzjny6 wrote on December 13, 2022 at 12:12 AM

https://doi.org/10.1016%2FS0893-6080%2897%2900011-7 is the key paper to start on

arhetorical t1_izwxay5 wrote on December 12, 2022 at 1:36 PM

I mostly hear about surrogate gradient descent, what other methods work well in practice?

aleph__one t1_izwyrcf wrote on December 12, 2022 at 1:48 PM

Yea the surrogate gradient stuff works ok, others that are decent 1) STDP variants, especially dopamine modulated STDP (emulates RL-like reinforcement) 2) for networks < 10M params, evolution strategies and similar zero-order solvers can work well operating directly on the weights 3) variational solvers can work if you structure the net + activations appropriately

[deleted] t1_izxdazr wrote on December 12, 2022 at 3:38 PM

[deleted]

arhetorical t1_izxbkdf wrote on December 12, 2022 at 3:26 PM

I see, thanks. Why did you choose to use SNNs for your application instead of conventional ANNs? Are you using a neuromorphic chip?

aleph__one t1_izxu46b wrote on December 12, 2022 at 5:30 PM

No neuromorphic chip. Main reason is interpretability.

arhetorical t1_izzryk4 wrote on December 13, 2022 at 1:15 AM

Oh, I haven't heard about using SNNs for interpretability. I thought they were on the same level as ANNs. Sorry for all the questions, but can you elaborate on how they're more interpretable?

2358452 t1_j04t3pw wrote on December 14, 2022 at 1:51 AM

The spiking events should be much more sparse and therefore probably easier to interpret.

DisWastingMyTime t1_izxiscu wrote on December 12, 2022 at 4:15 PM

Whats the problem you're using it for?

scraper01 t1_izx09k0 wrote on December 12, 2022 at 2:01 PM

Asynchronous behavior is hard to parallelize

IntelArtiGen t1_izwl2wr wrote on December 12, 2022 at 11:27 AM

>Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.

In a way you can also do that with regular NN. Usually we do "long training phase (many backprops) => only test phase". But we can do "backprop => test => backprop => test ..." if it applies to our task (it usually doesn't), simultaneously training and using one model.

Also it's always interesting to try new things but many propositions seemed to work on small image datasets like MNIST or CIFAR-10. For small neural networks and datasets with small inputs, there is always a possibility that the neural network will find a good weight "by chance", and that with enough computing power it'll converge. But, for large networks and large images, these solutions usually don't scale, I think it's important to try these solutions on ImageNet to evaluate how they scale (and to try to make them scale). What made backprop so popular is its ability to scale for very large networks and images.

-xylon t1_izx7jbc wrote on December 12, 2022 at 2:57 PM

Training in that manner tends to "forget" previous knowledge in the net.

IntelArtiGen t1_izxdej3 wrote on December 12, 2022 at 3:38 PM

I agree it's unperfect, as we are. When I tried to do it, I was still able to maintain a bit of knowledge in the network but I had to continously re-train on previous data.

It's hard to do "info1,2,3 => train => info4,5,6 => train => info7,8,9 => train [etc.]" and have the model remember info1,2,3

But you can do "info1,2,3 => train => info4,5,1 => train => info6,7,2 => train [etc.]". I used a memory to retain previous information and continously train the network on it and it works. Of course it's slower because you don't process all the new information, you mix it with old information. I guess there are better ways to do it.

EDMismyO2 t1_izy6ydb wrote on December 12, 2022 at 6:50 PM

A similar idea is used with experience replay in DQNs. For RL, it's important to ensure failure states are retained in the replay buffer so it keeps being reminded they are a failure or it starts to forget and then does dumb things. In RL the phenomenon is called 'catastrophic forgetting'.

ChuckSeven t1_j016zp2 wrote on December 13, 2022 at 9:39 AM

Doubt. I know the old stories too but large language models are essentially trained like that. Most never do an epoch twice and evaluate the model periodically.

jigarthanda-paal t1_izwz4ss wrote on December 12, 2022 at 1:51 PM

Seems similar to contrastive learning in the way that the artificial dataset is created.

based_goats t1_izxnblf wrote on December 12, 2022 at 4:46 PM

This seems like the progression of a fourier transform to a fast fourier transform.

[deleted] t1_izwfslu wrote on December 12, 2022 at 10:13 AM

[removed]

[deleted] t1_izwjdpo wrote on December 12, 2022 at 11:04 AM

[removed]

[deleted] t1_izwrmvu wrote on December 12, 2022 at 12:42 PM

[deleted]

seven-dev t1_izymtdy wrote on December 12, 2022 at 8:31 PM

How does forward forward work in relation to backprop?

AllowFreeSpeech t1_izyuqhl wrote on December 12, 2022 at 9:21 PM

Is this going to be another one of those throwaway ideas like "capsule networks"...

[deleted] t1_j0127kl wrote on December 13, 2022 at 8:31 AM

[removed]

rehrev t1_izwhvcp wrote on December 12, 2022 at 10:43 AM

Dude who tf thinks brains learn with backprop

SporkofVengeance t1_izwkx82 wrote on December 12, 2022 at 11:25 AM

Who tf claimed they do?

Kruki37 t1_izwtyo4 wrote on December 12, 2022 at 1:05 PM

This was a frustrating conversation

SporkofVengeance t1_izwuc6w wrote on December 12, 2022 at 1:09 PM

I was thinking more of actual researchers more than internet randos who don’t seem to understand how backprop itself works.

Comments