Viewing a single comment thread. View all comments

gradientpenalty t1_j6278gc wrote

Its not a problem of unicode but the tokenizer method they are using BPE. I don't forsee any solution in the future cause there aren't many high paying customer

TLDR; english use the least token because it provides the highest compression ratio in bytes to token size.

9

Luminite2 t1_j62kcmp wrote

Your tl;dr is a bit circular. English has the highest compression ratio because the tokenizer was trained to optimize compression on mostly English data. One could train a BPE-based tokenizer that compresses some other language really well but works poorly on English if that made sense for the intended application.

6

madmax_br5 OP t1_j629re3 wrote

Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?

−3