Viewing a single comment thread. View all comments

Red-Portal t1_j673lux wrote

> If all your layers are on different machines connected by a high-latency internet connection, this will take a long time.

This is called model parallelism and this is exactly why you don't want to do it.... unless you're forced to do so. That is, at the scale of current large language monstrosities, the model might not fit on a single node. But other than that, model parallelism is well known to be bad, so people avoid it. Nonetheless, this is a known issue and lots of work has been done in improving data parallelism with asynchronous updates like HOGWILD! and horovod, because we know this scales better.

19

currentscurrents OP t1_j674tf3 wrote

They have some downsides though. HOGWILD! requires a single shared memory, and horovod requires every machine to have a copy of the entire model.

A truly local training method would mean your model could be as big as all the machines put together. The order of magnitude in size increase could outweigh the poorer performance of forward-forward learning.

No idea how you'd handle them coming and going, you'd have to dynamically resize the network somehow - there are still other unsolved problems before we could have a GPT@home.

15