Viewing a single comment thread. View all comments

bradenjh OP t1_ixf191u wrote

Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:

"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."

Not the same approach of combining multiple sources, but a similar flavor.

4

ayse_ww t1_ixgbva3 wrote

This is quite interesting. Is such self-training scheme similar to recurrent network?

0