Submitted by starstruckmon t3_1027geh in MachineLearning

Paper : https://arxiv.org/abs/2301.00774

Abstract :

>We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.

163

Comments

You must log in or register to comment.

bloc97 t1_j2s05hy wrote

It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.

29

EmmyNoetherRing t1_j2s64d8 wrote

There’s a lot of cognitive psychology research in how human brains forget things strategically, which I always found interesting. Another point of evidence that it’s computationally possible to learn how to process complex info without hanging onto everything you observed in the process of learning.

17

DigThatData t1_j2s71x9 wrote

I'd like to see this evaluated on more than just a single dataset

7

Taenk t1_j2sc1a2 wrote

So you need 5 RTX 3090 to run BLOOM-176B at home instead of 8.

57

Taenk t1_j2sgndx wrote

Compared to what? I have been playing with it for a little bit via Petals and it performs decently, although ChatGPT certainly sets a very high bar of success. Personally I think that it is a shame, that OpenAI gets exclusive access to the absolutely massive dataset of interacting with actual humans and models like BLOOM could certainly profit from having publically accessible interactions.

3

nutpeabutter t1_j2snx76 wrote

From my personal interactions it just gave off this vibe that it was trained on websites, rather than the GPT-3 (both base and chat) models which felt much more natural. Something to do with having to learn too many languages?

4

yahma t1_j2ss1ox wrote

So with pruning and 8-bit quantization, are we able to run BLOOM-176B on a single GPU yet?

4

yahma t1_j2ssc01 wrote

I wasn't very impressed with BLOOMZ. Responses seem short and optimized for Q/A style output. Perhaps Zero-Shot and single-shot worked better than Bloom, but Bloom seemed to produce better output for stories or writing in general.

I was only able to test the 6B models though, so not sure how the 176B models compare.

1

Sylv__ t1_j2sukvs wrote

What's the TL;DR of the novelty here?

−1

itsnotlupus t1_j2tbhzu wrote

Can you prune a pruned model? And then prune that again?

There's apparently no retraining needed here. Just loop over the matrices and shrink them (although it'd be nicer if there was a code repo to actually see that in action.)

I get that each successive pruning is going to make things increasingly worse, but I'm wondering if this might mean you can take an OPT-175B model and shrink it down in size to fit on commodity hardware like OPT-6.7B while still being closer in performance to the larger initial model than to the natively smaller model.

2

currentscurrents t1_j2trd40 wrote

If only it could run on a card that doesn't cost as much as a car.

I wonder if we will eventually hit a wall where more compute is required for further improvement, and we can only wait for GPU manufacturers. Similar to how they could never have created these language models in the 80s, no matter how clever their algorithms - they just didn't have enough compute power, memory, or the internet to use as a dataset.

5

cdsmith t1_j2uzks4 wrote

The idea is that there's an inflection point: at first you are mainly removing (masking with zeros) dimensions whose values are extremely small anyway and don't make much difference in the response, so you don't lose much accuracy. But after you're removed those dimensions, the remaining dimensions are specifically the ones that do matter, so you can't just go find more non-impactful dimensions again. They are already gone.

As far as what would happen if you over-pruned a model trained on a large number of parameters, I'd naively expect it to do much worse. If you train on more parameters and then zero out significant weights, then not only do you have a lower-dimensional space to model in (which is unavoidable), but you also lose out on the information that was correlated with the dimensions you've captured, because at training time your model relied on the parameters you have now zeroed out to capture that information.

4

artsybashev t1_j2v9lx2 wrote

If you believe in singularity, at some point we reach an infinite loop where "AI" creates better methods to run calculations that it uses to build better "AI". In a way that is already happening but once that loop gets faster and more autonomous it can find a balance where the development is "optimally" fast.

1

chimp73 t1_j2vw251 wrote

I made a summary of the related work section with some help from ChatGPT:

> Pruning has been applied to smaller models, but has not been studied in large models like GPT with over 10 billion parameters. Previous pruning methods have required retraining the model after pruning, which is time-consuming and resource-intensive for large models like GPT. SparseGPT has been developed for pruning large GPT models without retraining. There has been significant research on post-training methods for quantizing GPT-scale models, which involve reducing the precision of the weights and activations in the model to reduce memory and computational requirements. The SparseGPT method can be used in conjunction with these quantization methods to further compress the model.

3

unkz t1_j2wzgf3 wrote

Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.

1

matth0x01 t1_j2x49gm wrote

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

1

unkz t1_j2x7ggd wrote

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).

1

visarga t1_j2yvpjs wrote

Recent papers showed even small models under 10B can benefit from training on multi-task data. Learning to solve a large number of tasks works even when the model is not over 60B.

But no model comes even at 50% of GPT-3's scores, not including closed models.

1

shawdys t1_j3xfbqn wrote

Has the SparseGPT code been published anywhere? I tried to look but couldn't find it.

2

johnrachwan t1_j4zz9vw wrote

I'm curious if results improve with some slight retraining

1

starstruckmon OP t1_j501y7y wrote

From the paper

>One natural avenue for future work would be to investigate fine-tuning mechanisms for such large-scale models, which would allow further accuracy recovery. We conjecture that this should be possible, and that probably at least 80-90% sparsity can be achieved with progressive pruning and fine-tuning.

So, that comes next. Though I doubt the 80-90% guesstimate.

1

mycall t1_j50h4l7 wrote

It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.

1

EmmyNoetherRing t1_j51x98z wrote

> As an alternative evaluation, we measure cross-entropy loss, which is used in scaling laws for pre-training, for the six emergent BIG-Bench tasks, as detailed in Appendix A. This analysis follows the same experimental setup from BIG-Bench (2022) and affirms their conclusions for the six emergent tasks we consider. Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However, this analysis does not explain why downstream metrics are emergent or enable us to predict the scale at which emergence occurs. Overall, more work is needed to tease apart what enables scale to unlock emergent abilities.

Don't suppose you know what cross-entropy is?

1

mycall t1_j51xq1r wrote

Loss/cost functions are used to optimize the model during training. The objective is almost always to minimize the loss function. The lower the loss the better the model. Cross-Entropy loss is a most important cost function. It is used to optimize classification models. The understanding of Cross-Entropy is pegged on understanding of Softmax activation function.

2

mycall t1_j51zz0w wrote

I wonder how pruning the sparsity affects emergent abilities in scaling parameters.

1

EmmyNoetherRing t1_j522inn wrote

So I'm in a different flavor of data science, which means I've got the basic terminology, but not the specifics. I know what a loss function is and what entropy is. What role does "cross" play here? A cross between what?

1

EmmyNoetherRing t1_j5253a8 wrote

>Softmax activation function

Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.

Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.

1