Submitted by bradenjh t3_z26fui in MachineLearning
bradenjh OP t1_ixf191u wrote
Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:
"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."
Not the same approach of combining multiple sources, but a similar flavor.
ayse_ww t1_ixgbva3 wrote
This is quite interesting. Is such self-training scheme similar to recurrent network?
Viewing a single comment thread. View all comments