Viewing a single comment thread. View all comments

float16 t1_j62agci wrote

Isn't this just the result of using certain tokenizers? Using Chinese as an example, no reasonable tokenizer developed with Chinese in mind would give you 17 tokens. You'd have maybe 6 to 8:

  1. 你好
  2. 高个子

...depending on whether it thinks 你好 and 高个子 should be split.

2

madmax_br5 OP t1_j62b2jq wrote

Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.

So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?

0

visarga t1_j67q45m wrote

The solution is to put more text in the other languages and re-train the tokeniser, it will adapt to the larger corpus by assigning more tokens.

1