Submitted by blacklemon67 t3_11misax in MachineLearning

Hey everyone!

A quick fermi estimate shows that if a person were to encounter 50,000 tokens a day (extremely high estimate, this is a novel per day assuming 1 token = 1 word) then by the time they are 20 they would have encountered 365 million tokens.

Obviously this person would be VERY well read. However, if we feed a transformer language model with the same number of tokens then according to scaling laws it would be worse than gpt-2 (which was trained with a dataset about an order of magnitude larger).

So the question is, why do language models need so many tokens? Does anyone know of any review papers/blog posts discussing this observation?

My theory is that we haven't yet found the most efficient architecture for language yet, and that transformers' ability to excell at many different tasks means that you need to give it a lot of data to force it to come up with the right neural circuits for the job.

TLDR: Humans need substantially fewer tokens than transformer language models. What's the current understanding for why this is?

12

Comments

You must log in or register to comment.

harharveryfunny t1_jbjhmif wrote

Humans don't learn by locking themselves in a room at birth with a set of encyclopedias, or a print-out of the internet. We learn by interaction with the world - perceive/generalize/theorize/experiment, learn from feedback, etc.

It's impressive how well these LLM's perform given what is really a very tough task - build an accurate world model given only "predict next word" feedback, but hardly surprising that they need massive amounts of data to compensate for the task being so tough.

23

harharveryfunny t1_jbjk9nb wrote

Just to follow up, the reason why the "interact with the world" approach is way more efficient is because it's largely curiosity driven - we proactively try to fill gaps in our knowledge rather than just go read a set of encyclopedias and hope it might cover what we need to know. We learn in a much more targeted fashion..

10

visarga t1_jbn5g3w wrote

On the other hand LLM has broad knowledge about all topics, a true dilettante. We can't keep up on that level.

2

mckirkus t1_jbkx54l wrote

Hellen Keller is an interesting example of what we are capable of without visual or aural inputs.

3

currentscurrents t1_jbnandw wrote

I think this is the wrong way to think about what LLMs are doing. They aren't modeling the world; they're modeling human intelligence.

The point of generative AI is to model the function that created the data. For language, that's us. You need all these tokens and parameters because modeling how humans think is very hard.

As LLMs get bigger, they can model us more accurately, and that's where all these human-like emergent abilities come from. They build a world model because it's useful for predicting text written by humans who have a world model. Same thing for why they're good at RL and task decomposition, can convincingly fake emotions, and inherit our biases.

2

bivouac0 t1_jbjk79f wrote

Truthfully, this has not been sufficiently researched and looking into this might yield improvements to LLMs. However it's also not completely surprising. Consider...

For Humans, something like 80% of a conversation is non-verbal (there are actual studies on this). This means that people get the meaning of words through other clues such as expression, tone, etc.. and thus our conversational inputs are much "richer" than simply a bunch of tokens.

You also need to consider that our verbal communication is augmented by a lot of other sensory input (ie.. visual). You learn what a "ball" is largely by seeing it, not hearing about it.

Also realize that LLMs generally use a very low learning rate (ie.. 1e-3) so a large number of tokens must be presented. It's not completely clear with people how this works but we do completely memorize some inputs (ie.. LR=1) and almost completely ignore others. This in itself could be an entire area of research. It would be good to understand why some phrases are "catchy" and others are forgettable. Obviously, AI today doesn't do this.

I'd also point out that LLMs are not exactly memorizing information. Studies have demonstrated their ability to learn facts but this is not purposeful knowledge retention. People have a better ability to do this and I suspect AI needs to develop a method to separate knowledge retention and language pattern modeling. Think about learning the state capitals. A person quickly learns to say "the capital of X is Y" and then can substitute in different memorized facts. AI learns the facts and the sentence patterns all in the same manner.

People can also use "thought" (ie.. search, hypothesis, etc..) to understand the meaning of sentences and form responses. Let's face it, at this point LLMs are just a brute force pattern matchers. There's nothing "intelligent" here.

8

endless_sea_of_stars t1_jbmda5p wrote

> develop a method to separate knowledge retention and language pattern modeling. Think about learning the state capitals. A person quickly learns to say "the capital of X is Y" and then can substitute in different memorized facts. AI learns the facts and the sentence patterns all in the same manner.

This sounds like a problem Toolformer is supposed to address. Instead of learning all the state capitals learn to call. "The capital of Indiana is [QA(Indiana, capital)]."

1

EmbarrassedHelp t1_jbjqy4o wrote

Human brains have structural components / shapes that likely help them learn languages easier:

https://en.wikipedia.org/wiki/Wernicke%27s_area https://en.wikipedia.org/wiki/Broca%27s_area

Human brains also start off with way more parameters than needed, and language is most effectively learned before the synaptic pruning reduces the number of parameters.

6

harharveryfunny t1_jbjxolz wrote

The LLM name for things like GPT-3 seems to have stuck, which IMO is a bit unfortunate since it's rather misleading. They certainly wouldn't need the amount of data they do if the goal was merely a language model, nor would we need to have progressed past smaller models like GPT-1. The "predict next word" training/feedback may not have changed, but the capabilities people are hoping to induce in these larger/ginormous models is now way beyond language and into the realms of world model, semantics and thought.

2

CaptainLocoMoco t1_jbmeg1l wrote

You shouldn't directly compare LLM training to human learning. LLM's are spawned with totally random weights, apart from the design choices of the architecture, the only learning signal they ever receive is from the training data. Human's are born with billions of years of information baked into them due to evolution. Comparing the two doesn't really make sense. I think this becomes way more obvious when you think about fine motor control instead of language modeling. I.e. a robot isn't going to learn how to walk as well as a human after the same amount of "training" time.

4

Acrobatic-Name5948 t1_jbjdcae wrote

If anyone knew this we would be created AGI already. Probably scale issues and some new ideas on top of deep learning.

2

Striking-Travel-6649 t1_jbjkstq wrote

I think you're on the money. Once we develop more novel network and system structures that are really good at what they do while still generalizing, it will be game over. I think the current models that ML engineers have created are not complex or nuanced enough to extract the kind of value that humans can out of a "small" number of tokens. The human brain is great at having centralized control, coordination across systems, and effective interconnection, and each subsystem can do its "tasks" extremely well and can generalize across tasks too. With that in mind, we are going to need much more complex systems to achieve AGI.

−2

IntelArtiGen t1_jbjg4wk wrote

>Humans need substantially fewer tokens than transformer language models.

We don't use tokens the same way. In theory you could build a model with 10000 billion tokens, including one for each number up to a limit. Obviously humans can't and don't do that. We're probably closer to a model which would do "characters of a word => embedding". Some models do that but they also do "token => embedding" because it improves results and it's easier for the models to learn. Those who make these models may not really care about the size of the model if they have the right machine to train it and if they just want to have the best results on a task without constraints on size efficiency.

Most NLP models aren't efficient regarding their size. Though I'm not sure there currently is a way to keep having the best possible results without doing things like this. If I tell you "what happened in 2018?", you need to have an idea of what "2018" means, and that it's not just a random number. Either: (1) you know it's a year because you've learned this number like all other tokens (and you have to do that for many numbers / weird words and you have a big model), or (2) you think it's a random number, you don't need 1 token / number, your model is much smaller, but you can't answer these questions precisely, (3) you can re-build an embedding for 2018 knowing it's 2-0-1-8, you have an accurate "character => embedding" model.

I don't think we have a perfect solution for (3) so we usually do (1) & (3). But just doing (3) is the way to go for smaller NLP models... or putting much more weights for (3) and much less for (1).

So the size of NLP models doesn't really mean anything, you could build a model with 100000b parameters but 99.999% of these parameters won't improve the results a lot, and are only required to answer very specific questions. We should focus on building better "character => embedding" models and on ways to compress word embeddings if we care about the size of NLP models (easier said than done).

1

Origin_of_Mind t1_jbn2m6d wrote

If you look at the studies of how children acquire language, for example "First verbs" by Michael Tomasello, the gist is that children understand quite a bit in their daily routine and actively participate in it -- well before they begin to understand and produce language. The language acquisition in children occurs in an already very capable nervous system, which "gets" a lot of stuff going on around it. Language gets tied into all that.

Our artificial neural networks do not have anything comparable. So, to use extremely simple architectures, we have to constrain them with super-human amount of input, to allow them by simple statistics to converge on interesting machinery which also to some extent "gets" not just the surface of language but discovers some of the deeper connections. Multi-modal systems should be able to see even more of the relevant underlying structure of the world, getting one step closer to what humans do.

1

Safe-Celebration-220 t1_jbuswo7 wrote

I think it’s because humans have multiple neural networks that connect together. Humans have a neural network to understand sight, smell, sound, touch, and taste that all are combined to create one neural network that is interconnected within itself. If you took a human brain that had no experience of anything, and could not process any of the 5 senses and was only able to process language than it would take that human brain billions of texts in order for it to be able to write sentences down. If GPT-3 had a neural network for all 5 senses and you taught it information based off of all of those senses than it could make connections that were previously impossible. GPT-3 will take a word and connect that word with other words to see how it fits into the context of a sentence but humans will take a word and see how that word connects with each and every sense and that takes less information to learn. A human can learn language faster by taking the things they see and connecting them with what the language they are learning. If a human could not connect their sight with language than learning that language becomes much much harder. So the challenge we face right now is learning to connect neural networks with other neural networks in the same way a human connects their neural networks.

1