IntelArtiGen t1_jbjg4wk wrote on March 9, 2023 at 2:14 PM

>Humans need substantially fewer tokens than transformer language models.

We don't use tokens the same way. In theory you could build a model with 10000 billion tokens, including one for each number up to a limit. Obviously humans can't and don't do that. We're probably closer to a model which would do "characters of a word => embedding". Some models do that but they also do "token => embedding" because it improves results and it's easier for the models to learn. Those who make these models may not really care about the size of the model if they have the right machine to train it and if they just want to have the best results on a task without constraints on size efficiency.

Most NLP models aren't efficient regarding their size. Though I'm not sure there currently is a way to keep having the best possible results without doing things like this. If I tell you "what happened in 2018?", you need to have an idea of what "2018" means, and that it's not just a random number. Either: (1) you know it's a year because you've learned this number like all other tokens (and you have to do that for many numbers / weird words and you have a big model), or (2) you think it's a random number, you don't need 1 token / number, your model is much smaller, but you can't answer these questions precisely, (3) you can re-build an embedding for 2018 knowing it's 2-0-1-8, you have an accurate "character => embedding" model.

I don't think we have a perfect solution for (3) so we usually do (1) & (3). But just doing (3) is the way to go for smaller NLP models... or putting much more weights for (3) and much less for (1).

So the size of NLP models doesn't really mean anything, you could build a model with 100000b parameters but 99.999% of these parameters won't improve the results a lot, and are only required to answer very specific questions. We should focus on building better "character => embedding" models and on ways to compress word embeddings if we care about the size of NLP models (easier said than done).