currentscurrents t1_j7wf3u0 wrote on February 9, 2023 at 9:50 PM

>What is the standard modeling approach to these kinds of problems?

The standard approach is reinforcement learning. It works, but it's not very sample-efficient and takes many iterations to train.

LLMs are probably so good at this because of their strong meta-learning abilities; during the process of pretraining they not only learn the task but also learn good strategies for learning new tasks.

This has some really interesting implications. Pretraining seems to drastically improve sample efficiency even if the pretraining was on a very different task. Maybe we could pretrain on a very large amount of synthetic, generated data before doing our real training on our finitely-sized real datasets.

avocadoughnut t1_j7xvd0p wrote on February 10, 2023 at 4:18 AM

Makes me wonder if pretraining makes the model converge on essentially a more efficient architecture that we could be using instead. I'm hoping this thought has already been explored, it would be interesting to read about.

Sm0oth_kriminal t1_j7y6wv6 wrote on February 10, 2023 at 6:14 AM

This is probably only the case in which there’s a very low “compression ratio” of model parameters to learned entropy.

Basically, if the model has “too many” parameters it can be distilled but we’ve found that, empirically, until that point is hit, transformers scale extremely well and are generally better than any other known architecture.

Another topic is sparsificafion, which takes a trained model and tries to cut out some percentage of weights that have a minimal output effect, then fine tuning that model. You can check out Neural Magic online and associated works… they can run models on CPUs that normally require GPUs

avocadoughnut t1_j7yaq8w wrote on February 10, 2023 at 7:00 AM

I'm considering a higher level idea. There's no way that transformers are the end-all-be-all model architecture. By identifying the mechanisms that large models are learning, I'm hoping a better architecture can be found that reduces the total number of multiplications and samples needed for training. It's like feature engineering.

nikgeo25 t1_j7yjicm wrote on February 10, 2023 at 8:56 AM

Know any papers related to their work? Magic sounds deceptive...