Submitted by starstruckmon t3_1027geh in MachineLearning

Paper : https://arxiv.org/abs/2301.00774

Abstract :

>We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.

163

Comments

You must log in or register to comment.

Taenk t1_j2sc1a2 wrote

So you need 5 RTX 3090 to run BLOOM-176B at home instead of 8.

57

bloc97 t1_j2s05hy wrote

It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.

29

omniron t1_j2stl7w wrote

Just shows we have a huge amount to learn about how these systems actually work

22

mycall t1_j50h4l7 wrote

It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.

1

learn-deeply t1_j2u53ek wrote

My unsubstantiated hypothesis: BLOOM is severely undertrained, so most neurons aren't contributing at all to the final result compared to OPT-175.

13

matth0x01 t1_j2u5rwm wrote

Sorry - What's meant by perplexity here?

3

prototypist t1_j2uskwt wrote

It's a metric comparing the model's generative probabilities / text predictions vs. the actual text.

4

matth0x01 t1_j2vxl6g wrote

Thanks! Hm, seems to be a measure of sharpness for the predicted words?

1

unkz t1_j2v9edv wrote

1

matth0x01 t1_j2vx7z4 wrote

Yes, I know the concept, but where's the connection to the pruning approach here?

2

unkz t1_j2wzgf3 wrote

Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.

1

matth0x01 t1_j2x49gm wrote

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

1

unkz t1_j2x7ggd wrote

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).

1

EmmyNoetherRing t1_j2s64d8 wrote

There’s a lot of cognitive psychology research in how human brains forget things strategically, which I always found interesting. Another point of evidence that it’s computationally possible to learn how to process complex info without hanging onto everything you observed in the process of learning.

17

currentscurrents t1_j2srptn wrote

I've seen other research that pruning as a continual process during training can actually improve performance. Which is interesting since that is what the brain does.

10

EmmyNoetherRing t1_j2ss8qe wrote

Learning is compression, sorta.

13

mycall t1_j50ibgp wrote

Not always. Imagination can be learning which is an expansion from steady state.

2

EmmyNoetherRing t1_j50q53i wrote

Huh, fair. Got a concrete example?

1

mycall t1_j51wahq wrote

I'm not exactly sure what it is or how it would manifest, but perhaps it is related to Emergent Abilities of Large Language Models

2

EmmyNoetherRing t1_j51x98z wrote

> As an alternative evaluation, we measure cross-entropy loss, which is used in scaling laws for pre-training, for the six emergent BIG-Bench tasks, as detailed in Appendix A. This analysis follows the same experimental setup from BIG-Bench (2022) and affirms their conclusions for the six emergent tasks we consider. Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However, this analysis does not explain why downstream metrics are emergent or enable us to predict the scale at which emergence occurs. Overall, more work is needed to tease apart what enables scale to unlock emergent abilities.

Don't suppose you know what cross-entropy is?

1

mycall t1_j51xq1r wrote

Loss/cost functions are used to optimize the model during training. The objective is almost always to minimize the loss function. The lower the loss the better the model. Cross-Entropy loss is a most important cost function. It is used to optimize classification models. The understanding of Cross-Entropy is pegged on understanding of Softmax activation function.

2

EmmyNoetherRing t1_j522inn wrote

So I'm in a different flavor of data science, which means I've got the basic terminology, but not the specifics. I know what a loss function is and what entropy is. What role does "cross" play here? A cross between what?

1

EmmyNoetherRing t1_j5253a8 wrote

>Softmax activation function

Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.

Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.

1

gordonisadog t1_j2w9xpf wrote

Didn’t we already learn that with dropout, 10 years ago?

1

Purplekeyboard t1_j2s8it2 wrote

Bloom's not very good, pruned or not.

16

Taenk t1_j2sgndx wrote

Compared to what? I have been playing with it for a little bit via Petals and it performs decently, although ChatGPT certainly sets a very high bar of success. Personally I think that it is a shame, that OpenAI gets exclusive access to the absolutely massive dataset of interacting with actual humans and models like BLOOM could certainly profit from having publically accessible interactions.

3

nutpeabutter t1_j2snx76 wrote

From my personal interactions it just gave off this vibe that it was trained on websites, rather than the GPT-3 (both base and chat) models which felt much more natural. Something to do with having to learn too many languages?

4

C0hentheBarbarian t1_j2sl0n3 wrote

What about BLOOMZ? Isn’t it fine tuned in a similar way to GPT-3? Instruction fine tuned?

2

yahma t1_j2ssc01 wrote

I wasn't very impressed with BLOOMZ. Responses seem short and optimized for Q/A style output. Perhaps Zero-Shot and single-shot worked better than Bloom, but Bloom seemed to produce better output for stories or writing in general.

I was only able to test the 6B models though, so not sure how the 176B models compare.

1

thejuror8 t1_j2ruboc wrote

60% sparsity seems astounding

10

DigThatData t1_j2s71x9 wrote

I'd like to see this evaluated on more than just a single dataset

7

starstruckmon OP t1_j2sqwfb wrote

Personally, I'd like to see this tested on a Chinchilla scale model.

8

yahma t1_j2ss1ox wrote

So with pruning and 8-bit quantization, are we able to run BLOOM-176B on a single GPU yet?

4

artsybashev t1_j2suada wrote

A100 can run about 75B parameters in 8bit. With pruning that is doable, but it wont be quite the same perplexity.

6

currentscurrents t1_j2trd40 wrote

If only it could run on a card that doesn't cost as much as a car.

I wonder if we will eventually hit a wall where more compute is required for further improvement, and we can only wait for GPU manufacturers. Similar to how they could never have created these language models in the 80s, no matter how clever their algorithms - they just didn't have enough compute power, memory, or the internet to use as a dataset.

5

artsybashev t1_j2v9lx2 wrote

If you believe in singularity, at some point we reach an infinite loop where "AI" creates better methods to run calculations that it uses to build better "AI". In a way that is already happening but once that loop gets faster and more autonomous it can find a balance where the development is "optimally" fast.

1

itsnotlupus t1_j2tbhzu wrote

Can you prune a pruned model? And then prune that again?

There's apparently no retraining needed here. Just loop over the matrices and shrink them (although it'd be nicer if there was a code repo to actually see that in action.)

I get that each successive pruning is going to make things increasingly worse, but I'm wondering if this might mean you can take an OPT-175B model and shrink it down in size to fit on commodity hardware like OPT-6.7B while still being closer in performance to the larger initial model than to the natively smaller model.

2

cdsmith t1_j2uzks4 wrote

The idea is that there's an inflection point: at first you are mainly removing (masking with zeros) dimensions whose values are extremely small anyway and don't make much difference in the response, so you don't lose much accuracy. But after you're removed those dimensions, the remaining dimensions are specifically the ones that do matter, so you can't just go find more non-impactful dimensions again. They are already gone.

As far as what would happen if you over-pruned a model trained on a large number of parameters, I'd naively expect it to do much worse. If you train on more parameters and then zero out significant weights, then not only do you have a lower-dimensional space to model in (which is unavoidable), but you also lose out on the information that was correlated with the dimensions you've captured, because at training time your model relied on the parameters you have now zeroed out to capture that information.

4

visarga t1_j2yvpjs wrote

Recent papers showed even small models under 10B can benefit from training on multi-task data. Learning to solve a large number of tasks works even when the model is not over 60B.

But no model comes even at 50% of GPT-3's scores, not including closed models.

1

drooobie t1_j2tgxnh wrote

It's probably approximately idempotent.

1

shawdys t1_j3xfbqn wrote

Has the SparseGPT code been published anywhere? I tried to look but couldn't find it.

2

johnrachwan t1_j4zz9vw wrote

I'm curious if results improve with some slight retraining

1

starstruckmon OP t1_j501y7y wrote

From the paper

>One natural avenue for future work would be to investigate fine-tuning mechanisms for such large-scale models, which would allow further accuracy recovery. We conjecture that this should be possible, and that probably at least 80-90% sparsity can be achieved with progressive pruning and fine-tuning.

So, that comes next. Though I doubt the 80-90% guesstimate.

1

mycall t1_j51zz0w wrote

I wonder how pruning the sparsity affects emergent abilities in scaling parameters.

1

Sylv__ t1_j2sukvs wrote

What's the TL;DR of the novelty here?

−1

chimp73 t1_j2vw251 wrote

I made a summary of the related work section with some help from ChatGPT:

> Pruning has been applied to smaller models, but has not been studied in large models like GPT with over 10 billion parameters. Previous pruning methods have required retraining the model after pruning, which is time-consuming and resource-intensive for large models like GPT. SparseGPT has been developed for pruning large GPT models without retraining. There has been significant research on post-training methods for quantizing GPT-scale models, which involve reducing the precision of the weights and activations in the model to reduce memory and computational requirements. The SparseGPT method can be used in conjunction with these quantization methods to further compress the model.

3