visarga OP t1_j1ad4ku wrote on December 22, 2022 at 8:59 PM

Reply to comment by Ne_Nel in Will we run out of data? by visarga

We're just beginning the journey of generating language data, until 2 years ago it was unthinkable. Today we have a bunch of generated datasets for math, code, multi-task tuning and behaviour based on rules. The trick is to validate whatever is generated.

visarga OP t1_j1dh3br wrote on December 23, 2022 at 2:25 PM

You can generate junk data, but it is hard to generate quality data. Human text is diverse and interesting. But in the last 1-2 years there are many teams generating data - math, code, diverse prompted tasks, and not generating just solutions, but sometimes also new problems and tests.

For example, it used to be necessary to label thousands of responses to tasks in order to train the human feedback model that is used to fine-tune GPT-3. So only OpenAI had a very good dataset, developed in-house. And for that reason GPT-3 ruled.

But more recently "Constitutional AI" will take a list of behavioural rules, the so called constitution, and using then will generate its own feedback data, and reach almost the same effect with human labeled feedback model. So it is automating AI alignment.