Viewing a single comment thread. View all comments

CKtalon t1_j625s3n wrote

The tokenization just saw a predominantly English corpus, so it naturally tokenised most common English words and left words from other languages in different sub word form.

They could increase the vocabulary size to something like 250000 from the current 30+k, but that would require retraining

8

madmax_br5 OP t1_j62anqr wrote

What would be the practical impacts of a larger vocabulary? There seems to ultimately be no way around this if you want a truly multilingual model; your vocabulary needs to be at least as large as the full set of symbols in all the languages in the corpus. But it would seem that the computational costs of this would be limited to the very beginning and very end of the model, which seems computationally insignificant compared to the attention layers that operate in vector space. In fact, doesn't a larger input vocabulary result in fewer net tokens to vectorize in the first place? If the vector space of the embedding has a fixed dimensionality (which I believe it does in the case of GPT3), then isn't each token the same mathematical size once embedded?

1

suflaj t1_j63bf1q wrote

Well for starters, it would probably have worse performance due to so many redundant features, and it would be much slower.

Remember that the embedding layer carries loads of overhead, as we're talking V * d matrices. So for a corpus of 250k and embedding vector of 768, ex., we're talking about 192M parameters just for the embedding layer. Maybe you can save some space by having a sparse embedder, but find me a free implementation of sparse layers that work as well as dense ones. Other than that, the 192M parameters are, before compression techniques, equivalent to 768M. And that's just in memory, and the gradient, unless sparsified, will be 768M PER BATCH.

This is without mentioning that you would likely need to increase the embedding dim to account for the 8 times times bigger vocabulary.

2

CKtalon t1_j62c6t5 wrote

GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.

1

HateRedditCantQuitit t1_j647xm6 wrote

I'm not sure how long you've been around, but before BPE came along, large vocabularies were actually quite a pain in the ass. You can find lots of literature around it before maybe 2016 (can't remember exact dates to look and I'm feeling lazy).

IIRC, a big issue was the final prediction layer. Say you're predicting a sequence 4k tokens long. Then you have 4k times vocab-size predictions. With a 50k token vocab, that's 200M predictions in memory (roughly 1 gig with floats). Lets say we want to equally compress 20x more languages, so we get 1M tokens (speaking super duper roughly), which means nearly 20GB just to represent the logits. If we wanted to handle a 40k long sequence, it's the difference between 20GB and 200GB of logits.

That said, BPE just takes in sequences of more-simple tokens. If you want to feed it unicode, go ahead. If you want to feed it something else, that will work too. It seems like you're mostly frustrated that LLM investments are focused on english right now, which is valid. Tech investments in general have a strong silicon valley bias, and a zillion people want to recreate that elsewhere. But that's a very hard economic question.

1

visarga t1_j67pv49 wrote

It's also the fact that content in English dwarfs content in other languages, and languages more similar to English also benefit, but not languages that have different scripts and fewer cognates.

1