HateRedditCantQuitit t1_ix4d4sx wrote on November 20, 2022 at 5:42 PM

You can scale semi-supervised learning much more easily and cheaply and safely than you can scale human-in-the-loop RL. Similar to why we don’t train self driving cars by putting them in the real world and making them learn by RL.

If we could put a language model in a body and let it learn safely through human tutoring in a more time effective and cost effective way, maybe it could be worthwhile. Today, it doesn’t seem to be the time effective or cost effective solution.

And while I’m on my podium, once LMs are in any loop commercially talking to people at scale, I expect this will be a huge topic.

Tangentially, check out this short story/novella that kinda explores the idea from a fictional perspective. It’s incredibly well written and interesting by a favorite author of mine. “The Lifecycle of Software Objects” by Ted Chiang https://web.archive.org/web/20130306030242/http://subterraneanpress.com/magazine/fall_2010/fiction_the_lifecycle_of_software_objects_by_ted_chiang

blazejd OP t1_ix7iz8w wrote on November 21, 2022 at 10:01 AM

Indeed the data availability, time and cost effectiveness seem are some sensible reasons I was considering. But we already have chatbots and voice assistants talking to people at scale, don't we?

AlexGrinch t1_ix49fzx wrote on November 20, 2022 at 5:17 PM

LMs do not aim to “learn” language, they just approximate the probability distribution of a given corpora of texts. One way to do this is to factorize this probability via chain rule and then assume the conditional independence of words/tokens which are far from each other.

Predicting next word is not some weird heuristic — it’s just the optimization problem which tries to approximate joint probability of some sequences of words. I believe it can be equally formulated as an MDP with reward of 1 for (previous words | next word) transitions present in the dataset.

At the end of the day training some model is the optimization problem. What is the objective of the approach you think is more intuitive than simply fitting the distribution of data you observe?

Cheap_Meeting t1_ix4l30k wrote on November 20, 2022 at 6:35 PM

I don't think this is a good answer. Modeling the probability distribution of language is not a worthwhile goal by itself. Which is why language modeling was a niche topic for a very long time. The reason that there has been so much interest in large language models in the last couple of years is that they do "learn" language.

AlexGrinch t1_ix4tgps wrote on November 20, 2022 at 7:30 PM

I would like to disagree with you. LM was a niche topic because we did not have necessary tools to build really complex models able to catch at least a fraction of complexity and richness of the natural language. Starting from Shannon’s experiments with simple N-gram LMs, researchers returned to language modeling again and again. Finally they got the tools to catch the underlying distribution (which is insanely complex and multimodal) really well.

If you manage to perfectly model the distribution of, for example, comments for threads in ML subreddit, you can easily run it to debate with me. And I will not be able to tell the difference.

Cheap_Meeting t1_ix520oo wrote on November 20, 2022 at 8:26 PM

Rereading my own comment, it could have been phrased better. Let me try again:

I think you are taking OP's question too literally. At least as I understand it, the intent of OP's question was: "Why are self-supervised autoregressive models the predominant form of generative models for language? Intuitively it would seem that the training process should be closer to how humans learn language."

idrajitsc t1_ix56883 wrote on November 20, 2022 at 8:54 PM

I think that's still answered pretty well by their original comment: probability distributions over sequences of words, given sufficient compute and good enough corpora, gets pretty close to the superficial aspects of language. And we can now learn those well with LLMs, so why insist on RL instead?

For actually learning language, in the sense of using it to convey meaningful, appropriate information, which LLMs so far cannot do, maybe it's better to take an RL approach. But I don't know how to write a reward function that encompasses that. So as long as we can't do the superior thing with either approach, we might as well focus on the easier approach to the superficial thing.

blazejd OP t1_ix7jekd wrote on November 21, 2022 at 10:07 AM

I think u/Cheap_Meeting understood my question a bit better here. The end goal is to create an NLP model that learns to can understand and communicate in natural language. This is why currently the main NLP benchmarks cover many different tasks etc. We use language models because that's an easier approach, but not necessarily better.

idrajitsc t1_ix8cpi8 wrote on November 21, 2022 at 3:10 PM

Sure, but the answer remains: what reward function do you use that encompasses understanding and communicating, on top of grammar? Conceptually the RL approach might be better, but that doesn't mean it's at all doable.

blazejd OP t1_ix8ibmh wrote on November 21, 2022 at 3:50 PM

>For actually learning language, in the sense of using it to convey meaningful, appropriate information, which LLMs so far cannot do, maybe it's better to take an RL approach. But I don't know how to write a reward function that encompasses that. So as long as we can't do the superior thing with either approach, we might as well focus on the easier approach to the superficial thing.

My understanding of this paragraph simply put is (correct me if I'm wrong) "RL might be better, but we don't know how to do it, so let's not try. Language models are doing fine.".

In my opinion, in science we should focus simultaneously on easier problems that can lead to shorter-term gains (language models) AND ALSO more difficult problems that are riskier but might be better long term (RL-based).

idrajitsc t1_ix8lrk0 wrote on November 21, 2022 at 4:14 PM

I mean, I'm not really sure what your ask is. People do work on RL for NLP. It just doesn't offer any huge advantage, and the reason your intuition doesn't translate to an actual advantage is because writing a reward function that reproduces the human feedback a baby receives is essentially impossible. And not just in a, it's hard but if we put enough work into it we can figure it out, kind of way.

blazejd OP t1_ix8k4wf wrote on November 21, 2022 at 4:03 PM

>Sure, but the answer remains: what reward function do you use that encompasses understanding and communicating, on top of grammar?

I realize this doesn't directly answer your question, so might point is that we don't know the answer, but we should at least try to pursue it.

brates09 t1_ix4xxas wrote on November 20, 2022 at 8:00 PM

Autoregressive next token prediction is incredibly compute efficient. By using causally masked attention you can make a valid prediction for every token in a sequence with a single forward pass during training. I imagine this is a large part of why AR models eg GPT won out in popularity over masked token prediction models (eg BERT).

Nameless1995 t1_ix6gugr wrote on November 21, 2022 at 2:40 AM

The jury is probably not out yet on that. Inititally, IIRC, BERT was posited to be better than GPT-style training for bidirectionality in modeling, and it was empirically shown too. But GPT-style model won out by scaling up much more. There may be some truth to what you said, because GPT gets much more training signal per iteration it may ultimately give us better result after scaled up, but I am not entirely sure why BERT-style model was not as scaled up (did people never try out of a priori hesitancy or they try didn't get good results and didn't report). Another issue is the rise of prompting - which is much more in tune with an autoregressive unidirectional training and it just falls out naturally much more easily from GPT-style training.

However, T5 style training is closer to BERT (T5 has a bidirectional encoder and a causal decoder, but the decoder only predicts some masked spans) to an extent. Recently this paper shows that you can get on par performance to fully causal decoder by using a scaled up T5 style model through a trick: https://arxiv.org/abs/2209.14500...again this may not get much practical purchase given how expensive SAP (the trick in the paper) can be but the question is open -- and perhaps there is a better middle way somewhere.

blazejd OP t1_ix7jpz2 wrote on November 21, 2022 at 10:12 AM

This is interesting, but I was thinking a bit more high-level. In essence, BERT and GPT are both self-supervised language models trained on passive data with a similar objective.

Nameless1995 t1_ix4vylx wrote on November 20, 2022 at 7:47 PM

tl;dr: what you are suggesting is much much harder to do, than just letting a LM go brrr in a large-scale internet text; and moreover, you are probably overestimating the benefits that your suggestion may provide. Research goes in the direction what is easier to set up ATM.

For passive training you can just take the language model and feed it gigabytes of internet data.

Much harder to do the same with a more interactive settings where some expert provide real time online feedback based on what the language model is doing. Where do you find some experts? Do you get humans in the loop? You may but you can't hope to compensate for the scale of data you can train passively with a bunch of humans set in a loop even if for years.

It's also very likely language models wouldn't learn that good (at least if current Transformers are started from a blank slate random initializations) from a simple "human in a loop" setting even if the body nurtured like a baby for a years. First, it would be lacking many other forms of multimodal interactive signals that a human gains in similar settings. Implementing a full miltimodally grounded model efficiently is not trivial if not impossible (much less trivial than making a simple language model do its thing - and it's an area that will probably require more research -- although there is progressm like PaLM-saycan, GATO). Second, "humans" may possess better inductive biases from the get go due to potentially evolutionary reasons making them more sample efficient than just randomly initialized language models.

Both of those limitations may be partially counteracted by large-scale internet text training (which also would be much faster than training the model like a human baby -- limited by the slowness of human trainers)

Moreover, it's not really an either-or. You can both train a model passively and start with that to "initialize the model", and then fine-tune it in human-in-the-loop style settings or using RL: https://arxiv.org/abs/2203.02155

Moreover, you can think of the "passive learning" as a form of interaction with the environment too. The model predicts the future state of the environment (future word), the environment returns the "true word" (although independent of model's interaction). A "reward" is calculated (based on the agent's action i.e prediction) from cross entropy loss of the probability distribution of model's prediction and the "true word" that the environment provides -- except in this case there isn't a live environment, but pre-recorded offline demonstrations from past environmental dynamics (human communications).

blazejd OP t1_ix7jmyo wrote on November 21, 2022 at 10:11 AM

Love this response, a lot of it coincides with the thoughts before, I will refer to it in my general response.

daking999 t1_ix5hbok wrote on November 20, 2022 at 10:10 PM

While there are plenty of good responses here I want to add that I don't think ~~you're~~ your idea is dumb. Taking a pre-trained LLM and using it to initialize an RL agent that tries to maximize upvotes when commenting on reddit would be pretty interesting.

Nameless1995 t1_ix6hzd0 wrote on November 21, 2022 at 2:50 AM

Yeah, that would be an interesting scalable way to get human feedback. Perhaps, someone is already doing it.

blazejd OP t1_ix7jusm wrote on November 21, 2022 at 10:14 AM

Interesting idea, but it would probably turn into something similar to Yannic Kilcher's 4chan model which was super toxic because people give the most upvote in highly controversial topics.

daking999 t1_ix7livm wrote on November 21, 2022 at 10:40 AM

I don't think so, that's not what gets upvoted on reddit (for the most part, on the popular subreddits). It would be moderate/left-leaning. It might even learn to be funny.

blazejd OP t1_ix7mr03 wrote on November 21, 2022 at 10:57 AM

Thank you everyone for your comments, they were really insightful and gave me some perspective I wouldn't have on my own. I am quite new to ML reddit so wasn't sure what to expect. Here is my quick summary/general reply.

Most of you agreed that we use language modelling because it is the most compute- and time-effective way and that's sort of the best thing we have *right now*, but RL would be interesting to incorporate. However, initializing solely with RL is difficult, including choosing a good objective.

This seems a bit similar to the hype about SVMs in early 2000s (from what I heard from senior researchers, it was a thing). Basically, back then we already had neural networks, but we weren't ready hardware/data-wise so at the time SVMs were performing better due to their simplicity but after 20 years we can clearly see neural nets was the right direction. It's easier to use language model now, they give better short-term performance, but in a couple decades probably RL will outperform them (although very likely multi-modality will be necessary).

A currently feasible step in this direction is merging the two concepts of language models and RL-based feedback. Some papers mentioned are: https://arxiv.org/abs/2203.02155 and "Experience Grounds Language" (although I didn't read them entirely yet). We could initialize a customer-facing chatbot with a language model and then update it RL-style which can be thought of as some form of online or continual learning. The RL objective could be the rating user gives after interacting with the system, the frequency of the use asking to talk to a human assistant or the sentiment of user replies (positive or negative). And if we could come up with that bouncing off ideas on reddit, then probably some company is already doing that.

If you are looking for more related resources, my thoughts were inspired by the field of language emergence (https://arxiv.org/pdf/2006.02419.pdf) and this work (https://arxiv.org/pdf/2112.11911.pdf).

bitcoin_analysis_app t1_ix5g23e wrote on November 20, 2022 at 10:01 PM

As well as learning policy, the human brain makes use of prediction error, much like self-supervised learning.

The signal from traditional RL (when you don't reframe it as AlexGrinch mentions), is much sparser than simply feeding the entire collection of human writing to a room full of GPU.

victotronics t1_ix5g254 wrote on November 20, 2022 at 10:01 PM

Children pick up on rules and then extrapolate them. "He bringed this to me". I don't think an AI will generate that since it has no general rules that it tries to apply to a special case.

asafaya t1_ix5xlre wrote on November 21, 2022 at 12:09 AM

"Language modeling" (Auto-regressive/Causal/Next-word prediction) is different from the general "language learning" term. The goal in mind for language model is just to model the distribution of a language X. Recently, what is going on with LLMs is that people are trying to dig into these distributions and try to utilize them using "Prompts/Chain of Thoughts".

I believe people are doing this because it is much more easier than setting up the environment that you are proposing. But eventually, language modeling will hit a limit, and we will run out of data, and then the interactive version of learning language will be the way to go beyond these limits.

There exist some approaches that go into this direction, which is called Grounded Language Learning. I suggest you check out "Experience Grounds Language" paper to get a better understanding of the big-picture.

blazejd OP t1_ix7k18e wrote on November 21, 2022 at 10:17 AM

What language models are doing is indeed modelling language distribution, but what ML community wants them to be doing and what is the end goal is to create a model that learns to understand and communicate with a language. You can see that by the ways we try to evaluate the language, for example asking it to solve math equations

asafaya t1_ix80sjx wrote on November 21, 2022 at 1:36 PM

>What language models are doing is indeed modelling language distribution, but what ML community wants them to be doing and what is the end goal is to create a model that learns to understand and communicate with a language. You can see that by the ways we try to evaluate the language, for example asking it to solve math equations

I totally agree that this is happening in the ML community. I believe they will hit a wall soon. Probably in ~3-5 years.

Kylaran t1_ix6th4k wrote on November 21, 2022 at 4:30 AM

Former developmental psychology student here — the reward function for humans is unbelievably complex and RL draws a lot of its assumptions on classical behaviorist principles rather than cognitive or statistical. One reason why cognitive science was born is to tackle exactly the paucity of stimulus argument ala Chomsky: human children learn language without that much explicit feedback at all.

In RL and NLP, there’s a lot of research in areas like content recommendation systems and using RL as feedback loops in chat in chatbots. In these cases, the language models already exist and the RL model is used to generate feedback into the language models.

Learning the language model itself using only reward would be a fundamentally different philosophical and empirical challenge for science.

blazejd OP t1_ix7kazf wrote on November 21, 2022 at 10:21 AM

Glad to hear a non-ML perspective on it! Initializing with language models and then using RL for feedback makes a lot of sense. Could you share any particular papers that I could look into?

waa007 t1_ix7yoy7 wrote on November 21, 2022 at 1:16 PM

Very Good point!

~~It seems like that apply GAN(Generative adversarial network) in NLP,~~ The main problem is that how to judge how much reward or penalty should to given extreme accurate,

blazejd OP t1_ix8il0p wrote on November 21, 2022 at 3:52 PM

Can you rephrase the last part of your second sentence? Don't quite get what you mean.

koiRitwikHai t1_ix95poh wrote on November 21, 2022 at 6:27 PM

It meant same as "how will you define an objective function?"

waa007 t1_ixayg3f wrote on November 22, 2022 at 2:07 AM

In general RL, the environment will get a accurate reward after the agent have a step, In NLP, It's hard to give a accurate reward except that there is a really person to teach the agent.

So I think how to give a accurate reward is the main problem.

I'm sorry that it has so little contact with GAN.

harharveryfunny t1_ixcpady wrote on November 22, 2022 at 1:32 PM

It seems to me the primary learning mode in the brain - what it fundamentally/automatically does via it's cortical architecture - is sequence prediction (as in predict next word). Correspondingly the primary way we learn language as a child is by listening and copying, and the most efficient language learning methods for adults have also been found to be immersive.

Reinforcement learning can also be framed in terms of prediction (predicting reward/response), and I suspect this is the way that "learning via advice" (vs experience) works, while noting that the former seems more fundamental and powerful - note how we learn more easily from our own experience rather than the advice of others.

I think reinforcement learning is over-hyped, and in animals reward-maximization is more behavior (based on predictive mechanism) than actual mechanism itself.

As far as ML goes, RL as mechanism seems a very tricky beast, notwithstanding the successes of DeepMind, whereas predictive transformer-based LLMs are simple to train and ridiculously powerful, exhibiting all sorts of emergent behavior.

I can't see the motivation for wanting to develop RL-based language models - makes more sense to me to do the opposite and pursue prediction-based reward maximization.

[D] Why do we train language models with next word prediction instead of some kind of reinforcement learning-like setup?

Comments