Nameless1995 t1_ix4vylx wrote on November 20, 2022 at 7:47 PM

tl;dr: what you are suggesting is much much harder to do, than just letting a LM go brrr in a large-scale internet text; and moreover, you are probably overestimating the benefits that your suggestion may provide. Research goes in the direction what is easier to set up ATM.

For passive training you can just take the language model and feed it gigabytes of internet data.

Much harder to do the same with a more interactive settings where some expert provide real time online feedback based on what the language model is doing. Where do you find some experts? Do you get humans in the loop? You may but you can't hope to compensate for the scale of data you can train passively with a bunch of humans set in a loop even if for years.

It's also very likely language models wouldn't learn that good (at least if current Transformers are started from a blank slate random initializations) from a simple "human in a loop" setting even if the body nurtured like a baby for a years. First, it would be lacking many other forms of multimodal interactive signals that a human gains in similar settings. Implementing a full miltimodally grounded model efficiently is not trivial if not impossible (much less trivial than making a simple language model do its thing - and it's an area that will probably require more research -- although there is progressm like PaLM-saycan, GATO). Second, "humans" may possess better inductive biases from the get go due to potentially evolutionary reasons making them more sample efficient than just randomly initialized language models.

Both of those limitations may be partially counteracted by large-scale internet text training (which also would be much faster than training the model like a human baby -- limited by the slowness of human trainers)

Moreover, it's not really an either-or. You can both train a model passively and start with that to "initialize the model", and then fine-tune it in human-in-the-loop style settings or using RL: https://arxiv.org/abs/2203.02155

Moreover, you can think of the "passive learning" as a form of interaction with the environment too. The model predicts the future state of the environment (future word), the environment returns the "true word" (although independent of model's interaction). A "reward" is calculated (based on the agent's action i.e prediction) from cross entropy loss of the probability distribution of model's prediction and the "true word" that the environment provides -- except in this case there isn't a live environment, but pre-recorded offline demonstrations from past environmental dynamics (human communications).

blazejd OP t1_ix7jmyo wrote on November 21, 2022 at 10:11 AM

Love this response, a lot of it coincides with the thoughts before, I will refer to it in my general response.