abhitopia OP t1_ixdnki4 wrote

u/maizeq - I have finished reading the Rosenbaum paper . It is certainly very accessible and useful paper to understand the details and nuances between various PC implementations. So thank you for sharing that.

The objective of the author seems to compare various versions of the algorithm and highlight subtle difference and does a great job at it. It does not however exploit the local synaptic plasticity in its implementation (and uses loops) which is exactly where l think lies the limitation of Pytorch, Jax, and Tensorflow.

For instance, one could imagine each node and each weight in an PC (non FPA) MLP network as a standalone process communicating with other nodes and weights process only via message passing to run completely asynchronously. Furthermore, we can limit the amount of commputation by thresholding the value of error nodes (so weight updates for connected weight processes with happen) in a sense enforcing sparsity.

May be I am wrong, I do not (yet) see why in this simple MLP it should be be possible to add new nodes (in a hot fashion), for example, if the activity in any node increases by certain threshold then scale up automatically preserving 2% activity per layer.

Contrast this with GPU based backward passes, a lot of wasteful computation can be prevented. At the very least, Backward doesn't need to weight for FP in the EM like learning algorithm that PC is.

P.S. - My motivation isn't PC==BP, but rather can PC replace BP and is it worth it.


abhitopia OP t1_ixdkoos wrote

Hey u/miguelstar98

> but you haven't really answered my questions, or explained the source of your confidence or perhaps I haven't fully grasped enough of the nuances of the problem to even have useful responses for you.

I am not sure which questions? Did you mean what you mentioned in your deleted post (which wasn't accessible to me)?

Anyways, I can see your original post now. Thanks for undeleting it.

>Software Designer's perspective:

I think actor model just makes a lot of sense to do asynchronous concurrent computations. Having said that, since Erlang is slow, I am actually considering using Actix library in Rust (The first step is for me to just write a pseudo code of the algorithm based on message passing)


>From a hardware design perspective:

I am not sure what you want to say. The difference here is not hardware but change in algorithm (BP vs PC). Afaik, BP requires synchronised forward and backward passes.

>From the Biologist's perspective:

I am not sure again. The intention isn't to say biological plausible is superior or we MUST imitate nature. It is rather something than current ML libraries don't do but seems doable in light of new PC research.

>From my personal perspective: I hope you can help clear up my understanding but what is the difference between predictive coding and model ensembles? I know that probably sounds like a dumb question, but can’t we just take a bunch of models that are really good at particular tasks and have a software layer that controls when to use which model and then combine their outputs to solve any general problem? Also if I need fault tolerance or I need to run inference, can’t I just use a cluster computer, why not 2? Isn’t this a solved problem when training large language models?

Hmm. Model ensembles and learning algorithms to train those models are two different topics. The focus here is not on the "inference" (FP) part which current libraries are really good at but the "learning" (BP) part. Not sure what else to say.
I highly recommend reading this tutorial on PC (and contrast against BP)


abhitopia OP t1_ixc6b1m wrote

Hey u/miguelstar98, OP here and still very enthusiastic. I have spend last 2 weeks studying predictive coding and still going through a lot of nuances. The more I think and read about it, the more confident I am about the utility of this project.

Btw, do you know what the comment was that got deleted by the moderator?

I shared the "neuroevolution though erlang" in my original post too. I really still think coding this (having read predictive coding) is so much easier in Erlang to make it fully asynchronous, and scalable. And worry about optimisation only later (e.g. using Rust NIFs or try to use cuda kernals)

>"Probabilistic scripts for automating common-sense tasks" by Alexander Lew
"We Really Don't Know How to Compute!" - Gerald Sussman (2011)

Haven't watched these, do you mind what you have in mind?


abhitopia OP t1_iw8xv1o wrote

You are right. A neuromorphic hardware would be better. The reason right now is that everything runs on top of beam in Erlang, but then I am hoping that we can use Rust to implement core functions as NIFs as u/mardabx pointed out. https://discord.com/blog/using-rust-to-scale-elixir-for-11-million-concurrent-users

Having said that, I also do not think that speed is really the most critical problem to solve here. (For example, human brains are not even as fast as Beam single threads) Petaflots of compute is needed today because modern DL uses dense representations (unlike brain) and needs to be retrained from scratch (lacks continual learning). If resilient and fault tolerant system (say written in Erlang/Elixir) which could learn continuously and optimised (say using sparse representations) existed, it would eventually surpass competition.


abhitopia OP t1_iw8udxk wrote

Thanks Mardabx for sharing your 3 cents. :) Very helpful.

The current ML systems today lack the scalability and fault tolerance which in my mind is more critical than training speed. Remember biological brains are not as fast either, but they are highly resilient and fault tolerant. And biological brains learning still surpasses some of the best AI currently trained on million of human equivalent life times. This is the direction I wanna go to where predictive coding based system runs continually, and scaled on demand, but it is never stopped.

Such a system would already be better than biological brain in the sense that brain is not scalable, but there is no such restriction on computer hardware systems.

Having said that, it is really impressive how performance gains can be had by using Rust (I didn't know it was even possible) and I am definitely open to using Rust to implement core functionality as NIFs (perhaps as optimisation). Thanks again for sharing.


abhitopia OP t1_iw6lxjr wrote

I cannot agree more with you. In fact, I think a lot of CompSci techniques like LSH, BloomFilter, etc can be used with sparse distributed representations (SDRs), which besides being robust, also provide high rubustness and capacity. And this is definitely one of the goals of the project as well.

Checkout Algorithmic Speedups vis LSH and bio-inspired hashing


abhitopia OP t1_iw6l77r wrote

Thanks for the response.

I am yet to read in details the work of Millidge, Tschantz, Song in detail. I agree that this is not PC in the sense that came out from NeuroScience literature. I have only thoroughly read Bogatz 2017 paper.
and next on my list is Can the Brain Do Backpropagation? —Exact Implementation of Backpropagation in Predictive Coding Networks (also from Bogatz).

>If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation

The interesting bit for me is not the exact correspondence with PC (as described in Neuroscience) but rather following properties that lend it suitable for asynchronous paralellisation is Local Synaptic Plasticity which I believe still holds good. The problem with backprop is not that it is not efficient, in fact it is highly efficient. I just cannot see how backprop systems can be scaled, and do online and continual learning.

>In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dt, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly.

Can't we make first order approximation, like we do in any gradient descent algorithm? Again emphasing that the issue is not only speed of learning.

I will certainly checkout the paper by Robert Rosenbaum and thanks for sharing that. I will comment more once I have read this paper.