Submitted by bradenjh t3_z26fui in MachineLearning
This post describes a case study where several different large language models (GPT-3, FLAN, Cohere, AI21) were used to label training data for a dramatically smaller model (RoBERTa) that gets the same score on a tough benchmark task, but is 1000x cheaper to deploy. It's interesting to note that using just one of the large language models to label the training data leaves quite a few points on the table; best results come from combining their various proposed labels. So it's not just model distillation—it's classic weak supervision (combining multiple noisy sources of signal to produce higher quality labels in large quantities). Has anyone else tried something similar?
learn-deeply t1_ixew0do wrote
tl;dr a thinly disguised ad comparing a zero-shot model with a fine-tuned model, of course the fine tuned model is going to be better. the lack of intellectual honesty really encourages me to try snorkel
also, /u/bradenjh good job pretending that you have no affiliation with the company