Submitted by currentscurrents t3_10n5e8z in MachineLearning

One problem with distributed learning with backprop is that the first layer can't update their weights until the computation has travelled all the way down to the last layer and then backpropagated back up. If all your layers are on different machines connected by a high-latency internet connection, this will take a long time.

In forward-forward learning, learning is local - each layer operates independently and only needs to communicate with the layers above and below it.

The results are almost-but-not-quite as good as backprop. But each layer can immediately update their weights based only on the information they received from the previous layer. Network latency no longer matters; the limit is just the bandwidth of the slowest machine.

62

Comments

You must log in or register to comment.

[deleted] t1_j670cvp wrote

Commenting because I’m also interested.

0

Red-Portal t1_j673lux wrote

> If all your layers are on different machines connected by a high-latency internet connection, this will take a long time.

This is called model parallelism and this is exactly why you don't want to do it.... unless you're forced to do so. That is, at the scale of current large language monstrosities, the model might not fit on a single node. But other than that, model parallelism is well known to be bad, so people avoid it. Nonetheless, this is a known issue and lots of work has been done in improving data parallelism with asynchronous updates like HOGWILD! and horovod, because we know this scales better.

19

currentscurrents OP t1_j674tf3 wrote

They have some downsides though. HOGWILD! requires a single shared memory, and horovod requires every machine to have a copy of the entire model.

A truly local training method would mean your model could be as big as all the machines put together. The order of magnitude in size increase could outweigh the poorer performance of forward-forward learning.

No idea how you'd handle them coming and going, you'd have to dynamically resize the network somehow - there are still other unsolved problems before we could have a GPT@home.

15

Shevizzle t1_j67fv7x wrote

ray would potentially be a good platform for this

1

master3243 t1_j67jwad wrote

Hinton says that it does not generalize as well on the toy problems he investigates. An algorithm not doing well on toy problems is often not a good sign. I predict that unless someone discovers a breakthrough, it will be worse than backprop despite operating faster (due to not having the bottlenecks as you suggested).

13

currentscurrents OP t1_j67lie8 wrote

I'm messing around with it to try to scale to a non-toy problem, maybe try to adapt it to one of the major architectures like CNNs or transformers. I'm not sitting on a ton of compute though, it's just me and my RTX 3060.

A variant paper, Predictive Forward-Forward, claims performance equal to backprop. They operate the model in a generative mode to create the negative data.

6

marcingrzegzhik t1_j67s9fp wrote

Forward-forward learning is a very interesting concept, and I think that in some cases it could definitely yield better results than distributed learning with backprop. It really depends on the size of the model, the latency of the connection, and the bandwidth of the slowest machine. I'm sure that in some cases it could be much faster, but I'm curious to know if there are any other advantages to using forward-forward learning over backprop for distributed learning.

2

theoryanddata t1_j69dx28 wrote

I remember reading about this type of concept, and iirc it does seem that there is quite a bit of local learning in biological neural networks. But global convergence of the model seems like a challenge with this type of scheme. Maybe there's some way to incorporate a periodic global backprop to address that? Has anyone tried it? Or maybe you don't even need it and the problem will disappear with enough scale

2

Lord_of_Many_Memes t1_j69sr57 wrote

my general feeling is even if it works, it will take more steps to get to the same loss than backprop, which in some sense cancel out the hardware advantage of the forward forward setting. I tried that on gpt and wikitext it just doesn’t converge on real problems, maybe something crucial is still missing.

1