Submitted by **abhitopia** t3_ytbky9
in **MachineLearning**

#
**liukidar**
t1_iwc2tbm wrote

Reply to comment by **maizeq** in **[Project] Erlang based framework to replace backprop using predictive coding** by **abhitopia**

Hello. Since it may be relevant for the conversation, I'd like to specify that the work by Song doesn't use FPA (except here where they mathematically prove the identity between fpa PC and BP) and all the experimental results in others of his papers are obtained via "normal" PC, where the prediction is updated at every iteration using gradient descent on the log joint probability (so, as far as my understatement of the theory is correct, it corresponds to the MAP on a probabilistic model). I'm not 100% sure about which papers by Millidge do and don't, but I'm quite confident that the majority don't (like here the predictions seem to be updated at every iteration; however, in the paper cited by abhitopia, apparently, they use FPA). Unfortunately, I'm not familiar with the work by Tschantz, so I cannot comment on that.

#
**maizeq**
t1_iwcimaq wrote

Thanks for the reply, there was some nuance left out of my comment since it was getting long enough, but if you take a closer look you'll find they all more or less adopt similar assumptions to make the two equivalent, and all suffer from the same points.

To be more specific:

The Millidge paper, which most of the BP = PC literature is based on uses the FPA assumption, and is not a descent on the log joint. (It also uses inverted models as I mentioned).

This paper by Song which was published in NeurIPS doesn't use the FPA-PC "directly", but achieves effectively the same thing by requiring the weight update to occur at a *precise* inference step, and requires that the modes are initialised to a feed-forward pass, and also requires the inference learning rate to be exactly 1. (All required for equivalence)

Does this sound familiar? That's right, this is literally computationally equivalent to backprop! (a forward pass and a sequential coordinated backward pass). This is intuitively obvious if you read the paper but you can see the Rosenbaum paper to see it play out experimentally also.

The Salvatori paper you linked uses the algorithm from the aforementioned Song paper, and so the same points apply. Note how they do not empirically evaluate "IL", which, in their terminology, corresponds to the actual PC algorithm.

Finally the Kinghorn paper you linked refers to standard uninverted (generative) PC, and isn't part of the BP=PC literature. (Note how label accuracy for MNIST is 80%, whereas in the inverted PC=BP models it can reach 97%).

From my practical experience in implementing a PC library the subpar performance of supervised generative PC for classification remains a difficulty. What's more, when using standard PC (in both inverted and uninverted settings), you have to be far more careful (vs. FPA) on account of the dynamics during inference being more complex; since standard PC takes in to account the current top-down beliefs at every time-step, something that is not done by the FPA.

As such you can easily experience divergence, or a failure to converge. This is likely why I haven't seen a single example of standard PC evaluated on a deep/complex inverted model. All the instances you see of "PC" evaluated on RNNs, CNN, deep MLPs etcs are FPA-PC (or the alternatives I mentioned above).

#
**liukidar**
t1_iwcnuo2 wrote

Hello. Thank you for your reply. I will go into the details as well since I think we're creating a good review of PC that may help all different kinds of people that are interested.

I think we should divide the literature into two sets: FPA PC and PC. All the papers we cited (Salvatori, Song, Millidge) belongs indeed to the FPA PC. The aim of those papers was basically to give theoretical proof to show that PC was able to replicate BP in the brain (despite using a lot of assumptions on how this can be done).

However, note that the goal of the papers you have cited is to provide an equivalence or approximation between PC and BP, and not to use PC with FPA as a general-purpose algorithm. In fact, the same authors have then realised several papers that do NOT use FPA, and are applied to different machine learning tasks. I believe that the original idea of creating a general library to run these experiments is more focused towards applications, and not towards reimplementing the experiments that show equivalence and approximations of PC. Something interesting to replicate, still from the same authors, is the following: https://arxiv.org/pdf/2201.13180.pdf. And I am not aware of any library that has implemented something similar in an efficient way.

In relation to the accuracy, I'm not sure about what reported by Kinghorn, but already in Whittington 2017, you can see that they get a 98% accuracy on MNIST with standard PC. So the performance of PC on those it's not to be doubted.

​

I agree there's a lack of evaluations on deeper and more complex architectures. However here you can see an example of what you called IL can do: https://arxiv.org/abs/2211.03481 .

#
**maizeq**
t1_iwcv5uh wrote

Thanks, and yes I agree, this might be useful to others.

As an aside, I have no qualms against standard generative PC (such as the paper you linked, and any other papers they have realised in that vein, indeed I'm a fan!). However, the discussion in this thread is about the equating of BP with PC, and in this regard, arguing "PC approximates backpropagation" when you really mean "this other heavily modified algorithm that was inspired by PC approximates backprop", is misleading. It is akin to saying an apple looks like an orange, if you throw away the apple and buy another orange.

It feels particularly egregious, when it turns out this modified algorithm is computationally equivalent to backpropagation, and as such the various neuroscientific justifications one applies may no longer hold (e.g. generative modelling is more sample efficient, or cortical hierarchies in the brain are characterised by top-down non-linear effects).

>In relation to the accuracy, I'm not sure about what reported byKinghorn, but already in Whittington 2017, you can see that they get a98% accuracy on MNIST with standard PC. So the performance of PC onthose it's not to be doubted.

Yes, this is the 97% value I referred to in my comment, if you look at the Whittington 2017 paper you will see this refers to an *inverted* architecture. In this case for a small ANN trained with standard PC without the FPA assumption.

Again, it's important to distinguish between the BP=PC literature, which this thread is related to, and other PC literature. I have no doubt plenty of interesting papers and insights exist in the latter!

#
**Ambitious_Smile_981**
t1_iwdrmam wrote

I don't see the problem of differentiating inverted and non-inverted architectures, as they are both generative models. The difference lies in *what* you are generating. In one case, you generate the label, and give as prior information the image, in the other, you generate the image giving the label as prior information.

Both have their advantages and disadvantages, but I don't see why the 'inverted' one is not interesting.

As of the BP = PC literature, I think that showing that by simply introducing a temporal scheduling for the weight updates of PC, we are able to obtain exact BP is interesting. I agree that this variation of PC loses all the advantages that PC has over BP, but it is still important to know that it is possible to derive exact backprop from a variational free energy.

#
**BerenMillidge**
t1_iy814ur wrote

Hi, author of some of the papers linked here. Broadly, Maizeq is right to distinguish between FPA-PC and ‘standard PC’ (the ‘inverted vs generative direction of the PC net is a different orthogonal direction). The equivalence between PC and BP only holds exactly in the case with the FPA (or some equivalent set of assumptions — for instance in the original Whittington paper they use the precision ratio tending to 0. Of course all of these limits are in some sense extreme and eliminate some (but not all) of the major advantages of PC (in some sense this was inevitable since if they exactly equal BP then they must very roughly have the same advantages/disadvantages as it). The way to view these works, at least as I have come to view them, is as a idealised exploration of a specific limit of PC. In recent work (https://arxiv.org/pdf/2206.02629), we expand on this limit idea and show that all current EBM approximations to BP, such as PC, Equilibrium-prop and Contrastive Hebbian learning, can be expressed as a single ‘infinitesimal inference limit’.

Overall I disagree that the work in this vein is particularly misleading, although this is a subjective assessment. It is upfront about the assumptions you need to make to obtain equivalence to backprop, as well as how this departs from standard PC.

Of course, from a neuroscientific perspective, this limit is perhaps not the most realistic and so we are also exploring the ML performance of more ‘standard’ PC versions which are more biologically plausible and which don’t approximate backdrop (, as well as specifically understanding the special advantages and disadvantages of these algorithms. For instance, in a recent paper -- https://www.biorxiv.org/content/biorxiv/early/2022/05/18/2022.05.17.492325.full.pdf --, we propose a new understanding of standard PC as ‘prospective configuration’ and demonstrate how this version of PC can outperform backdrop in a number of its properties. We also have a more theoretical analysis of standard PC (https://arxiv.org/pdf/2207.12316) where we show that although it differs from backdrop, it can also converge to minima of a supervised loss function, and has close links to target-propagation and hence Gauss-Newton optimization. Our groups have also explored other potential advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs (https://arxiv.org/pdf/2201.13180), the fact that you can significantly speed it up with incremental variants, and that you can get PC to perform a mix of iterative and amortised inference https://arxiv.org/pdf/2204.02169.

In terms of the hardware, I have also looked into this a little, and my feeling is that while PC has better parallelism properties than PC, it is unlikely to outperform BP on a GPU due to the need to iteratively perform the inference phase while BP just has a sequential forward and backward. GPUs are now getting very highly optimised for the exacts style of computations needed in BP for large scale ANNs. PC does possess a much higher degree of parallelism and locality than BP and on a sufficiently distributed architecture may eventually prove better, especially once we start building proper ‘neuromorphic’ processor-in-memory architectures. However this seems likely to be many years away. I haven’t read much about Erlang so I’m not sure if it possesses the degree of necessary parallelism. One possibility is that Erlang with Pc might allow you to move to a different point on the Pareto frontier of having lots of CPUs and developing learning algorithms comparable in performance with doing BP on a single GPU. I haven’t run any fermi-style estimates of whether this is feasible or not. We have some calculations about this in a forthcoming paper but this is on a highly abstract computation model of ‘parallel matrix multiplications’ and I haven’t figured out what the actual equivalent calculations for realistic hardware would look like.

Viewing a single comment thread. View all comments