| korchasa@*ops

Updated the data with popular models and switched the efficiency calculation from characters-per-token to tokens-per-text. Script: https://github.com/korchasa/tldr/tree/master/llm/tokens-size

Raw data: https://github.com/korchasa/tldr/blob/master/llm/tokens-size/results/token_results.md#tokenization-testing-results

Unexpected takeaways

• English has the lowest average token count (≈260.9) and the highest characters per token (≈4.26). • Ukrainian has the highest average token count (≈447.1). • Korean has the lowest characters per token (≈1.58), i.e. the tokenizer packs characters least efficiently. • meta-llama/llama-4-maverick is best on both metrics; microsoft/phi-4-reasoning-plus is worst on both. • meta-llama/llama-4-maverick is a clear outlier for Korean: 302 tokens and ≈2.08 chars/token, significantly better than other models for that language. • Across models, Ukrainian > Russian in tokens for the same content (≈447.1 vs ≈377.2).

Hidden patterns

• Ranking stability: English is the lowest-token language in 11/11 models; the highest-token language per model is Ukrainian in 7/11, Finnish in 2/11, Korean in 2/11. • Variability is driven by language: the standard deviation of tokens by language ranges from ≈21.7 to ≈89.4, dominating variation within a model. • Script effect: alphabetic Latin/Cyrillic languages pack more characters per token than Korean, across all models.

Summary — most and least efficient

• Overall efficiency (fewer tokens): • Language — most efficient: English; least: Ukrainian • Model — most efficient: meta-llama/llama-4-maverick; least: microsoft/phi-4-reasoning-plus

• Tokenizer efficiency (more characters per token): • Language — most efficient: English; least: Korean • Model — most efficient: meta-llama/llama-4-maverick; least: microsoft/phi-4-reasoning-plus