AlexGrinch
AlexGrinch t1_ix49fzx wrote
Reply to [D] Why do we train language models with next word prediction instead of some kind of reinforcement learning-like setup? by blazejd
LMs do not aim to “learn” language, they just approximate the probability distribution of a given corpora of texts. One way to do this is to factorize this probability via chain rule and then assume the conditional independence of words/tokens which are far from each other.
Predicting next word is not some weird heuristic — it’s just the optimization problem which tries to approximate joint probability of some sequences of words. I believe it can be equally formulated as an MDP with reward of 1 for (previous words | next word) transitions present in the dataset.
At the end of the day training some model is the optimization problem. What is the objective of the approach you think is more intuitive than simply fitting the distribution of data you observe?
AlexGrinch t1_ix4tgps wrote
Reply to comment by Cheap_Meeting in [D] Why do we train language models with next word prediction instead of some kind of reinforcement learning-like setup? by blazejd
I would like to disagree with you. LM was a niche topic because we did not have necessary tools to build really complex models able to catch at least a fraction of complexity and richness of the natural language. Starting from Shannon’s experiments with simple N-gram LMs, researchers returned to language modeling again and again. Finally they got the tools to catch the underlying distribution (which is insanely complex and multimodal) really well.
If you manage to perfectly model the distribution of, for example, comments for threads in ML subreddit, you can easily run it to debate with me. And I will not be able to tell the difference.