Viewing a single comment thread. View all comments

say_wot_again t1_itn823q wrote

From the abstract, it seems very similar to common self supervised techniques in computer vision. The difference is that in the case of computer vision SSL, you use the model's confident outputs on normal data to train its performance on heavily augmented data, whereas here you use the model's performance on "chain of thought" prompts to train its performance on normal prompts. But either way, the principle of "use the model's high confidence outputs on easy examples to train it on hard examples" stays the same. It's always cool to see this sort of cross pollination between vision and NLP, though the title seems designed to conjure up images of Westworld or Ex Machina.

Edit: it appears one massive difference is that in vision, the augmentation come from the modeler, whereas here the chains of thought actually come from the model's outputs. So it's leveraging the inherent randomness in LLM outputs to generate new training data by relying on the idea that answers that frequently appear in the output are likelier to be correct. This IS pretty cool, and meaningfully different from the vision SSL case insofar as it requires much less manual intervention.

61

DeezNUTSampler t1_itq1l2d wrote

Can you link works in Computer Vision SSL which incorporate this principle “use model’s high confidence outputs on easy examples to train it on hard examples”? It is not obvious to me how this would work. For example, in contrastive learning the objective is to learn view invariant representations. Two views of an object, augmented differently, are pushed together in representation space by minimizing the distance between them as our loss function. Which one would constitute the easy/hard example here?

5

say_wot_again t1_itrmhsx wrote

Here's an example of what I had in mind. Pseudolabels for unlabeled data are generated on the clean images, but the student model is trained on a strongly augmented version of the image. It's not contrastive learning because the objective is still explicitly object detection, but instead easy vs hard is the original image vs the strongly augmented one.

3