Viewing a single comment thread. View all comments

visarga t1_j39xs2x wrote

No, this concept is older, it predates Google. Hinton was working on it in 1986 and Schmidhuber in 1990s. By the way, "next token prediction" is not necessarily state of the art. The UL2 paper showed it is better to use a mix of masked spans.

If you follow the new papers, there are a thousand ideas floating around. How to make models learn better, how to make them smaller, how to teach the network to compose separate skills, why training on code improves reasoning skills, how to generate problem solutions as training data... we just don't know which are going to matter down the line. It takes a lot of time to try them out.

Here's a weird new idea: StitchNet: Composing Neural Networks from Pre-Trained Fragments. (link) People try anything and everything.

Or this one: Massive Language Models Can Be Accurately Pruned in One-Shot. (link) - maybe it means we will be able to run GPT-3 size models on a gaming desktop instead of a $150,000 computer

2