This post describes a case study where several different large language models (GPT-3, FLAN, Cohere, AI21) were used to label training data for a dramatically smaller model (RoBERTa) that gets the same score on a tough benchmark task, but is 1000x cheaper to deploy. It's interesting to note that using just one of the large language models to label the training data leaves quite a few points on the table; best results come from combining their various proposed labels. So it's not just model distillation—it's classic weak supervision (combining multiple noisy sources of signal to produce higher quality labels in large quantities). Has anyone else tried something similar?

Comments

You must log in or register to comment.

learn-deeply t1_ixew0do wrote on November 22, 2022 at 10:32 PM

tl;dr a thinly disguised ad comparing a zero-shot model with a fine-tuned model, of course the fine tuned model is going to be better. the lack of intellectual honesty really encourages me to try snorkel

also, /u/bradenjh good job pretending that you have no affiliation with the company

bradenjh OP t1_ixeyajh wrote on November 22, 2022 at 10:48 PM

Ha! If I was trying to pretend no affiliation, u/learn-deeply/ I probably wouldn't have a username literally matching the author string of the post?

You may also want to give it another read—the GPT-3 models are fine-tuned, that's the point! (The GPT-3 zero-shot baseline that I assume you're referencing is mentioned once as a curiosity but not compared to beyond that). You can even look at the full cross-product of fine-tuning RoBERTa vs GPT-3 on GT labels vs weak labels. With the larger training sets—the distilled and combined set of ~60k—they score essentially identically (within 0.1 point). i.e. you simply don't need all that GPT-3 capacity; all you need is the relevant information that it has for your problem.

Acceptable-Cress-374 t1_ixg7d43 wrote on November 23, 2022 at 5:01 AM

TBF, the article is pretty SEO-y and heavily uses bolded words that repeat throughout the article.

The research part is top-notch, tho, and opens up a lot of avenues for further training based on the (unusable at the amateur level) LLMs available now. Great work and thanks for sharing!

bradenjh OP t1_ixf191u wrote on November 22, 2022 at 11:10 PM

Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:

"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."

Not the same approach of combining multiple sources, but a similar flavor.

ayse_ww t1_ixgbva3 wrote on November 23, 2022 at 5:48 AM

This is quite interesting. Is such self-training scheme similar to recurrent network?

visarga t1_ixggjfm wrote on November 23, 2022 at 6:42 AM

> Has anyone else tried something similar?

Trying it right now, but instead of using GPT-3 I am splitting the data like cross-validation and training ensembles of models. Ensemble disagreement =~ error rate.

_Arsenie_Boca_ t1_ixgjhjp wrote on November 23, 2022 at 7:20 AM

As interesting as weak supervision is, the main takeaway is that using LLM few-shot predictions as labels to train a small model is a great approach to save labeling costs. Using snorkel on top means you have to query multiple LLMs and have snorkel as additional complexity, yielding only a few extra points. Perhaps those extra points also could have been achieved by letting the LLM label a few more samples or giving it a few more shots to get better labels

farmingvillein t1_ixoyqey wrote on November 25, 2022 at 3:48 AM

What is "GT"? The article does not appear to define.

bradenjh OP t1_ixt8qeu wrote on November 26, 2022 at 4:01 AM

"Ground Truth" or manual labels from an expert (as opposed to labels created programmatically with weak supervision).