Viewing a single comment thread. View all comments

ww3ace t1_j624na0 wrote

I don’t think any modern SOTA language model uses Unicode for tokenization.

1

madmax_br5 OP t1_j625fr2 wrote

The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.

1