Viewing a single comment thread. View all comments

madmax_br5 OP t1_j62b2jq wrote

Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.

So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?

0

visarga t1_j67q45m wrote

The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.

1