bradenjh OP t1_ixf191u wrote on November 22, 2022 at 11:10 PM

Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:

"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."

Not the same approach of combining multiple sources, but a similar flavor.

ayse_ww t1_ixgbva3 wrote on November 23, 2022 at 5:48 AM

This is quite interesting. Is such self-training scheme similar to recurrent network?