[deleted] t1_izwfslu wrote on December 12, 2022 at 10:13 AM

#911,696

[removed]

tysam_and_co t1_izwgts5 wrote on December 12, 2022 at 10:28 AM

#911,723

Great paper. I'd love to play with the concepts in it one day. That would be cool. Looks like a new paradigm, and as is usual for Hinton, has several decades of thought behind it. A little bit worried about the more philosophical bent he took near the end -- he's getting older and he usually does not have that grave kind of tone in his papers, as I can recall from my personal experience. I hope he, and all he is with and that is around him, is well. :') :')

1bir t1_izwh1u6 wrote on December 12, 2022 at 10:32 AM

#911,732

Implementations - Pytorch:
https://github.com/mohammadpz/pytorch_forward_forward
https://github.com/madcato/forward-forward-pytorch

pr0u t1_izwh94p wrote on December 12, 2022 at 10:35 AM

#911,739

Inb4 Schmidhuber invented this 20 years ago

[deleted] t1_izwhhrm wrote on December 12, 2022 at 10:38 AM

#911,742

Replying to pr0u (#911,739)

[deleted]

rehrev t1_izwhvcp wrote on December 12, 2022 at 10:43 AM

#911,754

Dude who tf thinks brains learn with backprop

[deleted] t1_izwjdpo wrote on December 12, 2022 at 11:04 AM

#911,794

[removed]

sea-shunned t1_izwkg9p wrote on December 12, 2022 at 11:19 AM

#911,822

For the purpose of designing something more akin to biology, does anyone know why there is no consideration of spiking neural networks (SNNs) in this proposal? Is it just that they are far less practical to implement than the Forward-Forward algorithm (and thus less attractive)? My understanding was that SNNs are more biologically realistic, but maybe not enough to be pursued.

DeepGamingAI t1_izwkqsg wrote on December 12, 2022 at 11:22 AM

#911,833

Replying to [deleted] (#911,742)

I liked a reply to jurgen from some twitter user saying if you have already solved agi now would be a good time to bring that up.

SporkofVengeance t1_izwkx82 wrote on December 12, 2022 at 11:25 AM

#911,841

Replying to rehrev (#911,754)

Who tf claimed they do?

IntelArtiGen t1_izwl2wr wrote on December 12, 2022 at 11:27 AM

#911,848

>Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.

In a way you can also do that with regular NN. Usually we do "long training phase (many backprops) => only test phase". But we can do "backprop => test => backprop => test ..." if it applies to our task (it usually doesn't), simultaneously training and using one model.

Also it's always interesting to try new things but many propositions seemed to work on small image datasets like MNIST or CIFAR-10. For small neural networks and datasets with small inputs, there is always a possibility that the neural network will find a good weight "by chance", and that with enough computing power it'll converge. But, for large networks and large images, these solutions usually don't scale, I think it's important to try these solutions on ImageNet to evaluate how they scale (and to try to make them scale). What made backprop so popular is its ability to scale for very large networks and images.

aleph__one t1_izwos7v wrote on December 12, 2022 at 12:11 PM

#912,014

Replying to sea-shunned (#911,822)

I use a custom SNN variant in production on real use cases, and the way we train those is very similar to the FF proposal. Most people just assume SNNs are impossible to train because SGD isn’t immediately available, when in reality there are dozens of ways to train SNNs to achieve solid performance.

[deleted] t1_izwrmvu wrote on December 12, 2022 at 12:42 PM

#912,129

[deleted]

Kruki37 t1_izwtyo4 wrote on December 12, 2022 at 1:05 PM

#912,231

Replying to SporkofVengeance (#911,841)

This was a frustrating conversation

SporkofVengeance t1_izwuc6w wrote on December 12, 2022 at 1:09 PM

#912,242

Replying to Kruki37 (#912,231)

I was thinking more of actual researchers more than internet randos who don’t seem to understand how backprop itself works.

captain_arroganto t1_izwufye wrote on December 12, 2022 at 1:10 PM

#912,248

Replying to aleph__one (#912,014)

> spiking neural networks

Could you give more details. Books, articles, tutorials, application areas etc.

I am curious to explore this area.

Thanks.

arhetorical t1_izwxay5 wrote on December 12, 2022 at 1:36 PM

#912,373

Replying to aleph__one (#912,014)

I mostly hear about surrogate gradient descent, what other methods work well in practice?

aleph__one t1_izwxy4q wrote on December 12, 2022 at 1:41 PM

#912,408

Replying to captain_arroganto (#912,248)

Unfortunately, beginner literature on this stuff is virtually nonexistent. Your best bet is to read papers and experiment.

aleph__one t1_izwyrcf wrote on December 12, 2022 at 1:48 PM

#912,448

Replying to arhetorical (#912,373)

Yea the surrogate gradient stuff works ok, others that are decent 1) STDP variants, especially dopamine modulated STDP (emulates RL-like reinforcement) 2) for networks < 10M params, evolution strategies and similar zero-order solvers can work well operating directly on the weights 3) variational solvers can work if you structure the net + activations appropriately

jigarthanda-paal t1_izwz4ss wrote on December 12, 2022 at 1:51 PM

#912,460

Seems similar to contrastive learning in the way that the artificial dataset is created.

ShepardRTC t1_izwzjw5 wrote on December 12, 2022 at 1:55 PM

#912,479

>The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.

Key quote. Still very cool though! Perhaps he can take it further and make it better than backprop.

scraper01 t1_izx09k0 wrote on December 12, 2022 at 2:01 PM

#912,503

Replying to sea-shunned (#911,822)

Asynchronous behavior is hard to parallelize

AsIAm t1_izx39lx wrote on December 12, 2022 at 2:25 PM

#912,627

Replying to tysam_and_co (#911,723)

His take on hardware for neural nets is pretty forward(-forward) thinking. Neural nets started by being analog (Rosenblatt's' Perceptron) and only later we started simulating them in software on digital computers. Some recent research (1,2) suggest that physical implementation of learnable neural nets is possible and way more efficient in analog circuits. This means that we could run extremely large nets on a tiny chip. Which could live in your toaster, or your skull.

JackandFred t1_izx3k5r wrote on December 12, 2022 at 2:27 PM

#912,637

Replying to ShepardRTC (#912,479)

Rather than a direct competitor, I wonder if there would be a use case where you might use both backprop and ff at different times and could get good results. So it wouldn’t have to be directly better than backprop, it could be only better for certain use cases

even_less_resistance t1_izx4eyw wrote on December 12, 2022 at 2:33 PM

#912,672

Replying to DeepGamingAI (#911,833)

Right dude? It's almost like he did and he is resentful of anyone touching on the path he used to get there... or he is almost there and doesn't want anyone catching up lol does he want the credit or not?

-xylon t1_izx7jbc wrote on December 12, 2022 at 2:57 PM

#912,801

Replying to IntelArtiGen (#911,848)

Training in that manner tends to "forget" previous knowledge in the net.

arhetorical t1_izxbkdf wrote on December 12, 2022 at 3:26 PM

#913,002

Replying to aleph__one (#912,448)

I see, thanks. Why did you choose to use SNNs for your application instead of conventional ANNs? Are you using a neuromorphic chip?

[deleted] t1_izxdazr wrote on December 12, 2022 at 3:38 PM

#913,091

Replying to aleph__one (#912,448)

[deleted]

IntelArtiGen t1_izxdej3 wrote on December 12, 2022 at 3:38 PM

#913,097

Replying to -xylon (#912,801)

I agree it's unperfect, as we are. When I tried to do it, I was still able to maintain a bit of knowledge in the network but I had to continously re-train on previous data.

It's hard to do "info1,2,3 => train => info4,5,6 => train => info7,8,9 => train [etc.]" and have the model remember info1,2,3

But you can do "info1,2,3 => train => info4,5,1 => train => info6,7,2 => train [etc.]". I used a memory to retain previous information and continously train the network on it and it works. Of course it's slower because you don't process all the new information, you mix it with old information. I guess there are better ways to do it.

gwern t1_izxgfqm wrote on December 12, 2022 at 3:59 PM

#913,245

Replying to JackandFred (#912,637)

It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why, but if it does, that might be interesting for RL and for symbolic purposes: just throw in 'components' like calculators or constrained optimization solvers to augment the native net. (If you can just throw them into your existing net and train with FF as usual, without having to worry about differentiating it or tacking on explicit RL, that would be very freeing.)

ShepardRTC t1_izxiqr1 wrote on December 12, 2022 at 4:15 PM

#913,360

Replying to gwern (#913,245)

The blackbox part is very interesting. Perhaps that will open up new avenues that no one had ever thought of.

DisWastingMyTime t1_izxiscu wrote on December 12, 2022 at 4:15 PM

#913,363

Replying to aleph__one (#912,014)

Whats the problem you're using it for?

based_goats t1_izxnblf wrote on December 12, 2022 at 4:46 PM

#913,549

This seems like the progression of a fourier transform to a fast fourier transform.

MrAcurite t1_izxrgis wrote on December 12, 2022 at 5:12 PM

#913,768

The actual paper

sheikheddy t1_izxu39a wrote on December 12, 2022 at 5:29 PM

#913,899

Replying to MrAcurite (#913,768)

I was at the Vector Institute around 2022-11-24 when he was giving a talk on this paper.

aleph__one t1_izxu46b wrote on December 12, 2022 at 5:30 PM

#913,904

Replying to arhetorical (#913,002)

No neuromorphic chip. Main reason is interpretability.

farmingvillein t1_izxuivo wrote on December 12, 2022 at 5:32 PM

#913,925

Replying to gwern (#913,245)

> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why

Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.

Or am I missing something.

gwern t1_izxv8bw wrote on December 12, 2022 at 5:37 PM

#913,948

Replying to farmingvillein (#913,925)

Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?

DeepNonseNse t1_izxxdf0 wrote on December 12, 2022 at 5:50 PM

#914,035

Replying to gwern (#913,245)

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

SoylentRox t1_izy1csf wrote on December 12, 2022 at 6:15 PM

#914,212

Replying to gwern (#913,245)

I have thought this is how we get to really robust, high performance AGI. It seems so obvious.

The steps are:. Have a test environment with a diverse number of auto graded tasks requiring varying levels of skill and cognition. "Big bench" but bigger, call it AGI gym.

The "AGI hypothesis" is an architecture of architectures : it's a set of components interconnected in some way, and those components came from a "seed library" or were auto discovered in another step as a composition of seed components.

The files to define a possible "AGI candidate" are simple and made to be manipulable as an output on an AGI gym task....

Recursion....

You see the idea. So basically I think truly effective AGI architectures are going to be very complex and human hypotheses are wrong. So you find them recursively using prior AGIs that did well on "AGI gym" which includes tasks to design other AGIs among the graded challenges...

Note at the end of the day you end up with a model that does extremely well at "AGI gym". With careful selection of the score heuristic we can select for models that are, well, general and as simple as possible.

It doesn't necessarily have any science fiction abilities, only it will do extremely well at tasks that are mutations of the gym task. If some of them are robotics tasks with realistic simulated input from the real world, it would do well in the real world at those tasks also.

Some of the tasks would be to "read this description of what I want you to do in the simulated world and do it with this robot". And the descriptions are procedurally generated from a very large set.

The whole process would be ongoing - each commit makes AGI gym harder, and the population of successful models gets ever more capable. You fund this by selling the services of the current best models.

farmingvillein t1_izy22t6 wrote on December 12, 2022 at 6:20 PM

#914,231

Replying to gwern (#913,948)

Got it.

I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).

drinkscoffeedrinks t1_izy4mxm wrote on December 12, 2022 at 6:36 PM

#914,336

Replying to sheikheddy (#913,899)

Any cool insights from that talk?

Fried_out_Kombi t1_izy4sc4 wrote on December 12, 2022 at 6:37 PM

#914,342

Replying to captain_arroganto (#912,248)

I'm a complete non-expert, just also trying to learn more about SNNs because they sound interesting, but I have this review paper from earlier this year bookmarked: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9313413/. Best I've found so far.

EDMismyO2 t1_izy6ydb wrote on December 12, 2022 at 6:50 PM

#914,460

Replying to IntelArtiGen (#913,097)

A similar idea is used with experience replay in DQNs. For RL, it's important to ensure failure states are retained in the replay buffer so it keeps being reminded they are a failure or it starts to forget and then does dumb things. In RL the phenomenon is called 'catastrophic forgetting'.

midasp t1_izybqp8 wrote on December 12, 2022 at 7:21 PM

#914,657

Replying to gwern (#913,948)

You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.

seven-dev t1_izymtdy wrote on December 12, 2022 at 8:31 PM

#915,160

How does forward forward work in relation to backprop?

IshKebab t1_izys7ni wrote on December 12, 2022 at 9:05 PM

#915,384

Replying to AsIAm (#912,627)

The trouble with analogue is that it's not repeatable. Have fun debugging your code when it changes every time you run it.

I mean, I'm sure it's possible... It definitely doesn't sound pleasant though.

AllowFreeSpeech t1_izyuqhl wrote on December 12, 2022 at 9:21 PM

#915,485

Is this going to be another one of those throwaway ideas like "capsule networks"...

Celmeno t1_izzjny6 wrote on December 13, 2022 at 12:12 AM

#916,474

Replying to captain_arroganto (#912,248)

https://doi.org/10.1016%2FS0893-6080%2897%2900011-7 is the key paper to start on

Phoneaccount25732 t1_izzmxc0 wrote on December 13, 2022 at 12:36 AM

#916,607

Replying to DeepNonseNse (#914,035)

Useful in what are now operator learning contexts maybe.

modeless t1_izzpcbe wrote on December 13, 2022 at 12:54 AM

#916,700

Replying to IshKebab (#915,384)

He calls it "mortal computation". Like instead of loading identical pretrained weights into every robot brain you actually train each brain individually, and then when they die their experience is lost. Just like humans! (Except you can probably train them in simulation, "The Matrix"-style.) But the advantage is that by relaxing the repeatability requirement you get hardware that is orders of magnitude cheaper and more efficient, so for any given budget it is much, much more capable. Maybe. I tend to think that won't be the case, but who knows.

qsnoodles t1_izzpcop wrote on December 13, 2022 at 12:54 AM

#916,701

Replying to pr0u (#911,739)

Inb4 Graham Sutherland comes back from the grave looking for whatever this photo/painting is

arhetorical t1_izzryk4 wrote on December 13, 2022 at 1:15 AM

#916,801

Replying to aleph__one (#913,904)

Oh, I haven't heard about using SNNs for interpretability. I thought they were on the same level as ANNs. Sorry for all the questions, but can you elaborate on how they're more interpretable?

Akrenion t1_j011yxv wrote on December 13, 2022 at 8:27 AM

#918,606

Replying to midasp (#914,657)

U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.

[deleted] t1_j0127kl wrote on December 13, 2022 at 8:31 AM

#918,615

[removed]

ChuckSeven t1_j016l2h wrote on December 13, 2022 at 9:33 AM

#918,782

Replying to Akrenion (#918,606)

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

ChuckSeven t1_j016or8 wrote on December 13, 2022 at 9:35 AM

#918,787

Replying to pr0u (#911,739)

I know this is a joke but Juergen doesn't have much work in this direction. He also doesn't care about biological plausibility. So I highly doubt that this will happen.

ChuckSeven t1_j016rtg wrote on December 13, 2022 at 9:36 AM

#918,791

Replying to modeless (#916,700)

Why exactly is hardware cheaper and more efficient?

ChuckSeven t1_j016ugh wrote on December 13, 2022 at 9:37 AM

#918,794

Replying to aleph__one (#912,408)

Time for you to write a series of beginner tutorials for the community! Now would be a good time.

ChuckSeven t1_j016zp2 wrote on December 13, 2022 at 9:39 AM

#918,800

Replying to -xylon (#912,801)

Doubt. I know the old stories too but large language models are essentially trained like that. Most never do an epoch twice and evaluate the model periodically.

aleph__one t1_j01c1kf wrote on December 13, 2022 at 10:50 AM

#918,978

Replying to ChuckSeven (#918,794)

Yea I was thinking the same thing. I teach some of this stuff at the graduate level but it’s tough for newcomers to get used to even in that setting.

modeless t1_j02fiss wrote on December 13, 2022 at 4:32 PM

#920,592

Replying to ChuckSeven (#918,791)

Without the requirement for exact repeatability you can use analog circuits instead of digital, and your manufacturing tolerances are greatly relaxed. You can use error-prone methods like self assembly instead of EUV photolithography in ten billion dollar cleanrooms.

Again, I don't really buy it but there's an argument to be made.

mgostIH t1_j02vbuy wrote on December 13, 2022 at 6:11 PM

#921,324

Replying to gwern (#913,948)

All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).

The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.

2358452 t1_j04t3pw wrote on December 14, 2022 at 1:51 AM

#924,619

Replying to arhetorical (#916,801)

The spiking events should be much more sparse and therefore probably easier to interpret.

Comments