Submitted by KD_A t3_127pbst in MachineLearning

GitHub: https://github.com/kddubey/cappr

Docs: https://cappr.readthedocs.io/en/latest/index.html

PyPI: https://pypi.org/project/cappr/

What is CAPPr?

CAPPr is a Python package which performs zero-shot text classification by estimating the probability that an inputted completion comes after an inputted prompt. CAPPr = "Completion After Prompt Probability".

Why?

The standard zero-shot classification method using LLMs is to sample a completion given a prompt. For example, if you're classifying animals, you'd have the LLM generate text after

The biological class of a blue whale is

and then hope the LLM outputs Mammalia.

Sampling usually works well. The problem is that the string you get could be any plausible completion, not necessarily one in your list of classes. So you'll have to write custom post-processing code for every new classification task you solve.

CAPPr addresses this problem by reframing the task as a series of simple computations:

Pr(Mammalia | The biological class of a blue whale is)

Read the (lengthy) motivation page of CAPPr's documentation if you're more curious.

Is it good?

I'm still trying to find out lol. I've evaluated CAPPr on a grand total of 2 datasets and a handful of examples. So if you're interested in using the cappr package, make sure to carefully evaluate it :-)

One interesting result is that on the Choice of Plausible Alternatives task, zero-shot text-curie-001 (a smaller GPT-3 model) is < 50% accurate when using sampling, but 80% accurate when using CAPPr. (Here's a link to the experiment notebook.) It would be cool to demonstrate that CAPPr squeezes more out of smaller or less-heavily trained LLMs, as CAPPr's performance may be based more on next-token prediction performance than instruction-based performance.

Feel free to install it and mess around, I'd be happy to hear what you think!

45

Comments

You must log in or register to comment.

nbviewerbot t1_jef48jq wrote

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

https://nbviewer.jupyter.org/url/github.com/kddubey/cappr/blob/main/demos/copa.ipynb

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

https://mybinder.org/v2/gh/kddubey/cappr/main?filepath=demos%2Fcopa.ipynb


^(I am a bot.) ^(Feedback) ^(|) ^(GitHub) ^(|) ^(Author)

10

PassingTumbleweed t1_jeg11bt wrote

Thanks for sharing! Can you explain the internals a bit more? How do you convert the user input into GPT prompt(s) and how do you turn the response(s) into a probability distribution?

4

KD_A OP t1_jegd2xh wrote

See my question on CrossValidated which fully explains the method. You can just skip to the Example section in there :-)

I also did a cool little computational optimization for HuggingFace models. That way, there isn't repeated computation for the prompt.

4

PassingTumbleweed t1_jegfgxg wrote

If you assumed the classes are exactly one-token long and equally common, then you could use the probability distribution $P(x_i|1:x_{i-1})$, exactly as returned by GPT APIs. Is that correct? And the rest of your work is to account for those two assumptions not being true?

2

KD_A OP t1_jeggt1k wrote

Yes, exactly. There's nothing else to it haha

I only wish the API had an interface to let you cache the prompt's keys and values. That'd save you money, and make CAPPr strictly cheaper than sampling for classification tasks.

2

PassingTumbleweed t1_jegonam wrote

Cool! I wonder if you've thought about synonyms. It seems like there might be a lot of cases where classes with more synonyms (or even cases like plurality , eg bird vs birds) are at a disadvantage.

2

KD_A OP t1_jegsqe6 wrote

That's a good criticism. I'd guess that this issue is quite problem-dependent. And I'd hope that an LM is good enough to discriminate between the correct-but-many-synonyms class and the wrong-but-few-synonyms class. (We're using the word synonym, but we really mean "high probability token path given prompt".) It's hard for me to come up with examples where this problem arises in a real classification task. But they may be out there.

2

PassingTumbleweed t1_jegvhb5 wrote

What I was thinking is that some kind of hierarchical LLM taxonomy might be interesting, where you can re-jigger the conditional probability tree onto any arbitrary vocab of token sequences.

2

KD_A OP t1_jegxas8 wrote

Interesting, and I think I know what you mean. One naive idea is a "top-k tokens" system. This system considers the top k highest probability tokens (conditional on previous ones) for each completion token, and for each completion. And then take the sum of the average likelihoods across all k^n (n = # completion tokens) paths for each completion. That would be one way to address this synonym problem. But ofc it results in way more computation.

Edit: actually, thinking a bit more, I think the synonym problem is more-or-less a non-issue for LMs trained to do next-token prediction.

2

PassingTumbleweed t1_jeh0p1j wrote

I'm curious to get your thoughts about a simple example where you have three classes: cat, dog, and bird. What happens if the top-1 prediction is "eagle"? Does that probability mass get discarded? Because it should probably go into the bird category

1

KD_A OP t1_jeh0ygl wrote

Yup it gets totally discarded. Hopefully, the conditional probability of bird is higher than cat or dog.

2

PassingTumbleweed t1_jeh1248 wrote

One thing I've seen with these LLMs is that you can prompt them with the classes using sort of a multiple choice style. It would be interesting to experiment with whether this can stabilize the outputs and reduce the amount of out of vocabulary predictions you get

2

planetofthemapes15 t1_jeg1iqc wrote

Cool, I had a mental model very similar to this which I was planning on implementing next week. I'll just try yours and if I make an improvement I'll submit a PR.

4

KD_A OP t1_jeghvnn wrote

Yeah I was surprised that this wasn't already coded up--it's been 3 years since we've found out that sampling from GPT-3 is a good zero-shot text classifier.

While benchmarking this method on the infamous Winograd Schema Challenge, I ended up finding a 2018 paper^1 w/ pretty much the same idea as CAPPr. The only difference is that CAPPr typically transposes that probability, and it naively incorporates a prior.

  1. Trinh, Trieu H., and Quoc V. Le. “A simple method for commonsense reasoning.” arXiv preprint arXiv:1806.02847 (2018).
3

nbviewerbot t1_jeghww5 wrote

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

https://nbviewer.jupyter.org/url/github.com/kddubey/cappr/blob/main/demos/wsc.ipynb

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

https://mybinder.org/v2/gh/kddubey/cappr/main?filepath=demos%2Fwsc.ipynb


^(I am a bot.) ^(Feedback) ^(|) ^(GitHub) ^(|) ^(Author)

2

Jean-Porte t1_jeg5xpd wrote

How does this compare to Huggingface zero shot NLI pipelines, eg https://huggingface.co/sileod/deberta-v3-base-tasksource-nli ?

3

KD_A OP t1_jegfh7i wrote

Great question! I have no idea lol.

More seriously, it depends on what you mean by "compare". CAPPr w/ powerful GPT-3+ models is likely gonna be more accurate. But you need to pay to hit OpenAI endpoints, so it's not a fair comparison IMO.

If you can't pay to hit OpenAI endpoints, then a fairer comparison would be CAPPr + GPT-2—specifically, the smallest one in HuggingFace, or whatever's closest in inference speed to something like bart-large-mnli. But then another issue which pops up is that GPT-2 was not explicitly trained on the NLI/MNLI task in the same way bart-large-mnli was. So I'd need to finetune GPT-2 (small) on MNLI to make a fairer comparison.

If I had a bunch of compute and time, I'd like to benchmark (or find benchmarks) for the following text classification approaches, varying the amount of training data if feasible, and ideally on tasks which are more realistic than SuperGLUE:

  • similarity embeddings
    • S-BERT
    • GPT-3+ (they claim their ada model is quite good)
  • sampling
  • MNLI-trained models
  • CAPPr
1