Submitted by Lajamerr_Mittesdine t3_ycipui in MachineLearning
Paper: https://arxiv.org/abs/2210.11610
Abstract:
>Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
say_wot_again t1_itn823q wrote
From the abstract, it seems very similar to common self supervised techniques in computer vision. The difference is that in the case of computer vision SSL, you use the model's confident outputs on normal data to train its performance on heavily augmented data, whereas here you use the model's performance on "chain of thought" prompts to train its performance on normal prompts. But either way, the principle of "use the model's high confidence outputs on easy examples to train it on hard examples" stays the same. It's always cool to see this sort of cross pollination between vision and NLP, though the title seems designed to conjure up images of Westworld or Ex Machina.
Edit: it appears one massive difference is that in vision, the augmentation come from the modeler, whereas here the chains of thought actually come from the model's outputs. So it's leveraging the inherent randomness in LLM outputs to generate new training data by relying on the idea that answers that frequently appear in the output are likelier to be correct. This IS pretty cool, and meaningfully different from the vision SSL case insofar as it requires much less manual intervention.