Submitted by bradenjh t3_z26fui in MachineLearning

This post describes a case study where several different large language models (GPT-3, FLAN, Cohere, AI21) were used to label training data for a dramatically smaller model (RoBERTa) that gets the same score on a tough benchmark task, but is 1000x cheaper to deploy. It's interesting to note that using just one of the large language models to label the training data leaves quite a few points on the table; best results come from combining their various proposed labels. So it's not just model distillation—it's classic weak supervision (combining multiple noisy sources of signal to produce higher quality labels in large quantities). Has anyone else tried something similar?

23

Comments

You must log in or register to comment.

learn-deeply t1_ixew0do wrote

tl;dr a thinly disguised ad comparing a zero-shot model with a fine-tuned model, of course the fine tuned model is going to be better. the lack of intellectual honesty really encourages me to try snorkel

also, /u/bradenjh good job pretending that you have no affiliation with the company

49

bradenjh OP t1_ixeyajh wrote

Ha! If I was trying to pretend no affiliation, u/learn-deeply/ I probably wouldn't have a username literally matching the author string of the post?

You may also want to give it another read—the GPT-3 models are fine-tuned, that's the point! (The GPT-3 zero-shot baseline that I assume you're referencing is mentioned once as a curiosity but not compared to beyond that). You can even look at the full cross-product of fine-tuning RoBERTa vs GPT-3 on GT labels vs weak labels. With the larger training sets—the distilled and combined set of ~60k—they score essentially identically (within 0.1 point). i.e. you simply don't need all that GPT-3 capacity; all you need is the relevant information that it has for your problem.

3

bradenjh OP t1_ixf191u wrote

Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:

"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."

Not the same approach of combining multiple sources, but a similar flavor.

4

Acceptable-Cress-374 t1_ixg7d43 wrote

TBF, the article is pretty SEO-y and heavily uses bolded words that repeat throughout the article.

The research part is top-notch, tho, and opens up a lot of avenues for further training based on the (unusable at the amateur level) LLMs available now. Great work and thanks for sharing!

17

visarga t1_ixggjfm wrote

> Has anyone else tried something similar?

Trying it right now, but instead of using GPT-3 I am splitting the data like cross-validation and training ensembles of models. Ensemble disagreement =~ error rate.

2

_Arsenie_Boca_ t1_ixgjhjp wrote

As interesting as weak supervision is, the main takeaway is that using LLM few-shot predictions as labels to train a small model is a great approach to save labeling costs. Using snorkel on top means you have to query multiple LLMs and have snorkel as additional complexity, yielding only a few extra points. Perhaps those extra points also could have been achieved by letting the LLM label a few more samples or giving it a few more shots to get better labels

2

farmingvillein t1_ixoyqey wrote

What is "GT"? The article does not appear to define.

1