Submitted by mrx-ai t3_zjud5l in MachineLearning
Details in the twitter thread:
https://twitter.com/martin_gorner/status/1599755684941557761
Submitted by mrx-ai t3_zjud5l in MachineLearning
Details in the twitter thread:
https://twitter.com/martin_gorner/status/1599755684941557761
Rather than a direct competitor, I wonder if there would be a use case where you might use both backprop and ff at different times and could get good results. So it wouldn’t have to be directly better than backprop, it could be only better for certain use cases
It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why, but if it does, that might be interesting for RL and for symbolic purposes: just throw in 'components' like calculators or constrained optimization solvers to augment the native net. (If you can just throw them into your existing net and train with FF as usual, without having to worry about differentiating it or tacking on explicit RL, that would be very freeing.)
> It mentions that it can handle non-differentiable blackbox components. I don't quite intuit why
Isn't this just because there is no backwards pass being calculated? Where you're taking a loss, then needing to calculate a gradient, etc.
Or am I missing something.
Yeah, it obviously doesn't have a gradient, but what I don't quite get how the blackbox component trains without a gradient being computed by anything. Is it a finite difference equivalent? Does it reduce down to basically REINFORCE? What is it, and is it really low-variance enough to care about or is it merely a curiosity?
You are right. Intuitively, it's just rewarding correct inputs and penalizing wrong inputs. Which is largely similar to how many RL policies learn. FF seem like it will be able to discriminate, but it won't be able to encode and embed features the way back prop does. It would not identify common features. If you try to train a typical back prop based u-net architecture network, my instincts say it likely would not work since the discriminating information is not distributed across the entire network.
U-net is specifically designed for backprop. It the skip connections are helpful for bp. We might need to rethink architectures for other approaches as well.
That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".
Got it.
I'm going to guess that the author meant that you could stick a black box in the middle and all of the neurons could still be trained (but not the black box itself).
All the layers are trained independently at the same time, you can use gradients but you don't need backprop because you can use explicit descriptions since each layer will have as a problem maximizing ||W * x||^2 for good samples, minimizing it for bad samples (each layer gets a normalized version of the previous output).
The issue I find in this is (besides generating good contrastive examples) that I don't understand how this would lead a big network to discover interesting structure: circuits require multiple layers to do something interesting, but here each layer greedily optimizes its own evaluation. In some sense we are hoping that the output of the past layers will orient things in a way that doesn't make it too hard for the next layers, which have only linear dynamics.
As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.
So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.
Useful in what are now operator learning contexts maybe.
The blackbox part is very interesting. Perhaps that will open up new avenues that no one had ever thought of.
I have thought this is how we get to really robust, high performance AGI. It seems so obvious.
The steps are:. Have a test environment with a diverse number of auto graded tasks requiring varying levels of skill and cognition. "Big bench" but bigger, call it AGI gym.
The "AGI hypothesis" is an architecture of architectures : it's a set of components interconnected in some way, and those components came from a "seed library" or were auto discovered in another step as a composition of seed components.
The files to define a possible "AGI candidate" are simple and made to be manipulable as an output on an AGI gym task....
Recursion....
You see the idea. So basically I think truly effective AGI architectures are going to be very complex and human hypotheses are wrong. So you find them recursively using prior AGIs that did well on "AGI gym" which includes tasks to design other AGIs among the graded challenges...
Note at the end of the day you end up with a model that does extremely well at "AGI gym". With careful selection of the score heuristic we can select for models that are, well, general and as simple as possible.
It doesn't necessarily have any science fiction abilities, only it will do extremely well at tasks that are mutations of the gym task. If some of them are robotics tasks with realistic simulated input from the real world, it would do well in the real world at those tasks also.
Some of the tasks would be to "read this description of what I want you to do in the simulated world and do it with this robot". And the descriptions are procedurally generated from a very large set.
The whole process would be ongoing - each commit makes AGI gym harder, and the population of successful models gets ever more capable. You fund this by selling the services of the current best models.
Inb4 Schmidhuber invented this 20 years ago
Inb4 Graham Sutherland comes back from the grave looking for whatever this photo/painting is
I know this is a joke but Juergen doesn't have much work in this direction. He also doesn't care about biological plausibility. So I highly doubt that this will happen.
[deleted]
I liked a reply to jurgen from some twitter user saying if you have already solved agi now would be a good time to bring that up.
Right dude? It's almost like he did and he is resentful of anyone touching on the path he used to get there... or he is almost there and doesn't want anyone catching up lol does he want the credit or not?
Implementations - Pytorch:
https://github.com/mohammadpz/pytorch_forward_forward
https://github.com/madcato/forward-forward-pytorch
I was at the Vector Institute around 2022-11-24 when he was giving a talk on this paper.
Any cool insights from that talk?
Great paper. I'd love to play with the concepts in it one day. That would be cool. Looks like a new paradigm, and as is usual for Hinton, has several decades of thought behind it. A little bit worried about the more philosophical bent he took near the end -- he's getting older and he usually does not have that grave kind of tone in his papers, as I can recall from my personal experience. I hope he, and all he is with and that is around him, is well. :') :')
His take on hardware for neural nets is pretty forward(-forward) thinking. Neural nets started by being analog (Rosenblatt's' Perceptron) and only later we started simulating them in software on digital computers. Some recent research (1,2) suggest that physical implementation of learnable neural nets is possible and way more efficient in analog circuits. This means that we could run extremely large nets on a tiny chip. Which could live in your toaster, or your skull.
The trouble with analogue is that it's not repeatable. Have fun debugging your code when it changes every time you run it.
I mean, I'm sure it's possible... It definitely doesn't sound pleasant though.
He calls it "mortal computation". Like instead of loading identical pretrained weights into every robot brain you actually train each brain individually, and then when they die their experience is lost. Just like humans! (Except you can probably train them in simulation, "The Matrix"-style.) But the advantage is that by relaxing the repeatability requirement you get hardware that is orders of magnitude cheaper and more efficient, so for any given budget it is much, much more capable. Maybe. I tend to think that won't be the case, but who knows.
Why exactly is hardware cheaper and more efficient?
Without the requirement for exact repeatability you can use analog circuits instead of digital, and your manufacturing tolerances are greatly relaxed. You can use error-prone methods like self assembly instead of EUV photolithography in ten billion dollar cleanrooms.
Again, I don't really buy it but there's an argument to be made.
For the purpose of designing something more akin to biology, does anyone know why there is no consideration of spiking neural networks (SNNs) in this proposal? Is it just that they are far less practical to implement than the Forward-Forward algorithm (and thus less attractive)? My understanding was that SNNs are more biologically realistic, but maybe not enough to be pursued.
I use a custom SNN variant in production on real use cases, and the way we train those is very similar to the FF proposal. Most people just assume SNNs are impossible to train because SGD isn’t immediately available, when in reality there are dozens of ways to train SNNs to achieve solid performance.
> spiking neural networks
Could you give more details. Books, articles, tutorials, application areas etc.
I am curious to explore this area.
Thanks.
Unfortunately, beginner literature on this stuff is virtually nonexistent. Your best bet is to read papers and experiment.
Time for you to write a series of beginner tutorials for the community! Now would be a good time.
Yea I was thinking the same thing. I teach some of this stuff at the graduate level but it’s tough for newcomers to get used to even in that setting.
I'm a complete non-expert, just also trying to learn more about SNNs because they sound interesting, but I have this review paper from earlier this year bookmarked: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9313413/. Best I've found so far.
https://doi.org/10.1016%2FS0893-6080%2897%2900011-7 is the key paper to start on
I mostly hear about surrogate gradient descent, what other methods work well in practice?
Yea the surrogate gradient stuff works ok, others that are decent 1) STDP variants, especially dopamine modulated STDP (emulates RL-like reinforcement) 2) for networks < 10M params, evolution strategies and similar zero-order solvers can work well operating directly on the weights 3) variational solvers can work if you structure the net + activations appropriately
[deleted]
I see, thanks. Why did you choose to use SNNs for your application instead of conventional ANNs? Are you using a neuromorphic chip?
No neuromorphic chip. Main reason is interpretability.
Oh, I haven't heard about using SNNs for interpretability. I thought they were on the same level as ANNs. Sorry for all the questions, but can you elaborate on how they're more interpretable?
The spiking events should be much more sparse and therefore probably easier to interpret.
Whats the problem you're using it for?
Asynchronous behavior is hard to parallelize
>Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.
In a way you can also do that with regular NN. Usually we do "long training phase (many backprops) => only test phase". But we can do "backprop => test => backprop => test ..." if it applies to our task (it usually doesn't), simultaneously training and using one model.
Also it's always interesting to try new things but many propositions seemed to work on small image datasets like MNIST or CIFAR-10. For small neural networks and datasets with small inputs, there is always a possibility that the neural network will find a good weight "by chance", and that with enough computing power it'll converge. But, for large networks and large images, these solutions usually don't scale, I think it's important to try these solutions on ImageNet to evaluate how they scale (and to try to make them scale). What made backprop so popular is its ability to scale for very large networks and images.
Training in that manner tends to "forget" previous knowledge in the net.
I agree it's unperfect, as we are. When I tried to do it, I was still able to maintain a bit of knowledge in the network but I had to continously re-train on previous data.
It's hard to do "info1,2,3 => train => info4,5,6 => train => info7,8,9 => train [etc.]" and have the model remember info1,2,3
But you can do "info1,2,3 => train => info4,5,1 => train => info6,7,2 => train [etc.]". I used a memory to retain previous information and continously train the network on it and it works. Of course it's slower because you don't process all the new information, you mix it with old information. I guess there are better ways to do it.
A similar idea is used with experience replay in DQNs. For RL, it's important to ensure failure states are retained in the replay buffer so it keeps being reminded they are a failure or it starts to forget and then does dumb things. In RL the phenomenon is called 'catastrophic forgetting'.
Doubt. I know the old stories too but large language models are essentially trained like that. Most never do an epoch twice and evaluate the model periodically.
Seems similar to contrastive learning in the way that the artificial dataset is created.
This seems like the progression of a fourier transform to a fast fourier transform.
[removed]
[removed]
[deleted]
How does forward forward work in relation to backprop?
Is this going to be another one of those throwaway ideas like "capsule networks"...
[removed]
Dude who tf thinks brains learn with backprop
Who tf claimed they do?
I was thinking more of actual researchers more than internet randos who don’t seem to understand how backprop itself works.
ShepardRTC t1_izwzjw5 wrote
>The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue.
Key quote. Still very cool though! Perhaps he can take it further and make it better than backprop.