madmax_br5 OP t1_j62b2jq wrote on January 27, 2023 at 5:04 AM

Reply to comment by float16 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.

So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?

visarga t1_j67q45m wrote on January 28, 2023 at 8:57 AM

The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.