maizeq

maizeq t1_jd6kpnj wrote

> The Cerebras CS-2 is designed to accelerate unstructured sparsity, whereas GPUs are not.

Don’t modern NVIDIA GPUs (2000s+) have strong support for sparsity (maximum theoretical flops are doubled when doing sparse computation?). From their documentation the type of sparsity they support is also unstructured (e.g randomly pruned values in tensors). Does the Cerebras chip have higher sparse flops, or does the comparison not make sense?

5

maizeq t1_j7jwzai wrote

I can understand their (the Meta/Google engineers) frustration when perspectives like yours proliferate everywhere.

Transformers were invented at Google. OpenAI is overwhelmingly a net consumer of AI research, and incredibly closed off on the few innovations they have actually made. There is a graph somewhere for research output of the various research labs that shows that despite OpenAI 300-400 or so employees, their publicly released open access research is a ridiculously tiny fraction of that of other research labs. Consider the damage this might do if their success convinces management at other tech labs to be more closed off with their AI research, further concentrating the ownership of AI into the hands of a single, or select few corporations. In this sense OpenAI is actively harming the democratisation of AI, which given the previously unseen productivity generating effects AI will have seems like a dangerous place to be in.

10

maizeq t1_j69vuec wrote

Nice. How are you converting between dataset size and number of tokens?

Doesn’t common crawl get deduplicated and that’s why the number of usable tokens decreases - or is it also curation? How much of that 380TiB is actually utilisable.

Given the ostensibly impressive performance of the bilingual GLM-130B (Chinese+English) model that came out of Tsinghua university that might very well be the case.

1

maizeq t1_j2w7p8k wrote

Is this following a pre-existing methodology in the literature or something custom for your usage? I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space. How do you achieve something similar in your binary latent space?

Sorry for all the questions!

2

maizeq t1_j1ckd3o wrote

I would be interested in helping. (Currently in AI research but not focussed on LLMs).

I don’t like the idea that the user feedback OpenAI is accumulating from ChatGPT is contributing to deepening their moat (I highly doubt they will release all that data publicly).

For a company founded on principles of openness to be working directly against the democratisation of AI, some serious criticism is warranted I think.

I could perhaps understand if there was a need for profitability to ensure the cost of their research, but the models they are commercialising are by and large models based on the research of other labs which are far more open with releasing their work. Their closed approach will simply incentivise and push other research labs to make their research more closed also, further increasing the likelihood of AI being concentrated in the hands of very few.

6

maizeq t1_iwcv5uh wrote

Thanks, and yes I agree, this might be useful to others.

As an aside, I have no qualms against standard generative PC (such as the paper you linked, and any other papers they have realised in that vein, indeed I'm a fan!). However, the discussion in this thread is about the equating of BP with PC, and in this regard, arguing "PC approximates backpropagation" when you really mean "this other heavily modified algorithm that was inspired by PC approximates backprop", is misleading. It is akin to saying an apple looks like an orange, if you throw away the apple and buy another orange.

It feels particularly egregious, when it turns out this modified algorithm is computationally equivalent to backpropagation, and as such the various neuroscientific justifications one applies may no longer hold (e.g. generative modelling is more sample efficient, or cortical hierarchies in the brain are characterised by top-down non-linear effects).

>In relation to the accuracy, I'm not sure about what reported byKinghorn, but already in Whittington 2017, you can see that they get a98% accuracy on MNIST with standard PC. So the performance of PC onthose it's not to be doubted.

Yes, this is the 97% value I referred to in my comment, if you look at the Whittington 2017 paper you will see this refers to an inverted architecture. In this case for a small ANN trained with standard PC without the FPA assumption.

Again, it's important to distinguish between the BP=PC literature, which this thread is related to, and other PC literature. I have no doubt plenty of interesting papers and insights exist in the latter!

2

maizeq t1_iwcimaq wrote

Thanks for the reply, there was some nuance left out of my comment since it was getting long enough, but if you take a closer look you'll find they all more or less adopt similar assumptions to make the two equivalent, and all suffer from the same points.

To be more specific:

The Millidge paper, which most of the BP = PC literature is based on uses the FPA assumption, and is not a descent on the log joint. (It also uses inverted models as I mentioned).

This paper by Song which was published in NeurIPS doesn't use the FPA-PC "directly", but achieves effectively the same thing by requiring the weight update to occur at a precise inference step, and requires that the modes are initialised to a feed-forward pass, and also requires the inference learning rate to be exactly 1. (All required for equivalence)

Does this sound familiar? That's right, this is literally computationally equivalent to backprop! (a forward pass and a sequential coordinated backward pass). This is intuitively obvious if you read the paper but you can see the Rosenbaum paper to see it play out experimentally also.

The Salvatori paper you linked uses the algorithm from the aforementioned Song paper, and so the same points apply. Note how they do not empirically evaluate "IL", which, in their terminology, corresponds to the actual PC algorithm.

Finally the Kinghorn paper you linked refers to standard uninverted (generative) PC, and isn't part of the BP=PC literature. (Note how label accuracy for MNIST is 80%, whereas in the inverted PC=BP models it can reach 97%).

From my practical experience in implementing a PC library the subpar performance of supervised generative PC for classification remains a difficulty. What's more, when using standard PC (in both inverted and uninverted settings), you have to be far more careful (vs. FPA) on account of the dynamics during inference being more complex; since standard PC takes in to account the current top-down beliefs at every time-step, something that is not done by the FPA.

As such you can easily experience divergence, or a failure to converge. This is likely why I haven't seen a single example of standard PC evaluated on a deep/complex inverted model. All the instances you see of "PC" evaluated on RNNs, CNN, deep MLPs etcs are FPA-PC (or the alternatives I mentioned above).

3

maizeq t1_iw5mh0v wrote

I will save you a significant amount of wasted time and tell you now that predictive coding (as it has been described more or so for 20 years in the neuroscience literature) is not equivalent to backpropagation in the way that Millidge, Tschantz, Song and co have been suggesting for the last two years.

It is extremely disheartening to see them continue to make this claim when they are clearly using a heavily modified version of predictive coding (called FPA PC, or fixed predicted assumption PC), which is so distinct to PC it is a significant stretch to lend it the same name.

For one predictive coding under the FPA no longer corresponds to MAP estimation on a probabilistic model (gradient descent on the log joint probability), so it loses its interpretation as a variational Bayes algorithm (something that afaik has not been explicitly mentioned by them thus far).

Secondly, if you spend any appreciable time on predictive coding you will realise that the computational complexity of FPA PC is guaranteed to be at best equal to backpropagation (and in most cases significantly worse).

Thirdly, FPA-PC requires "inverted" PC models in order to form this connection with backpropagation. These are models where high dimensional observations (such as images), parameterise latent states - no longer rendering them generative models in the traditional sense.

FPA PC can really be understood as just a dynamic implementation of backprop (with very little actual connection to predictive coding). This implementation of backpropagation is in many ways practically inefficient and meaningless. Let me use an analogy to make this more clear: Let's say you want to assign the variable a to f(x). You could either do a = f(x). Or you could set up a to update based on da/dt = a - f(x). The fixed/convergence point of which results in a = f(x). But if you think about it, if you already have the value 25, this is just a round about method of assigning a.

In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dz, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly. I would check out this paper by Robert Rosenbaum which I think is quite fantastic if you want more nitty gritty details, and which deflates a lot of the connections espoused between the two works, particularly from a practical perspective.

I don't mean to be dismissive of the work of Millidge and co! Indeed, I think the original 2017 paper by Whittington and Bogacz was extremely interesting and a true nugget of insight (in terms of how PC with certain variance relationships between layers can approximate backprop etc. - something which makes complete sense when you think about it), but the flurry of subsequent work that has capitalised on this subtle relationship has been (in my honest opinion) very misleading.

Also, I would also not take any of what I've said as a dismissal of predictive coding in general. PC for generative modeling (in the brain) is extremely interesting, and may be promising still.

34

maizeq t1_iujf76j wrote

The sampling method used with diffusion/score models is in fact a type of approximate MCMC. As another commentator mentioned, it’s the result of discretising (hence approximate) an SDE that has the log data probability (under the model) as its equilibrium distribution.

The advantage of Langevin sampling methods vs a method like Metropolis-Hastings is better efficiency (lower mixing time), because it reduces random walk behaviour. It also scales better with higher dimensionality.

What made modern diffusion/score based models successful was combining this with a schedule of additive noise, and conditioning the score models on the scale of this noise (the time-step). This solved various problems with the traditional score matching objective (like poor performance in low density regions).

1

maizeq t1_is57j81 wrote

Transformers aren’t my field of expertise so I don’t know if this has been done before but hah, neat derivation!

Though I would expect their to be no difference in loss in that case. Was the difference positive or negative? And do you think the difference can be chalked up to numerical precision errors that accumulate due to the double vs single matrix multiplication? An easy test of this would be to compare K’ and Wq (XWk)t and see how close they are throughout training for a particular sample.

5