
Updated the data with popular models and switched the efficiency calculation from characters-per-token to tokens-per-text. Script: https://github.com/korchasa/tldr/tree/master/llm/tokens-size
Unexpected takeaways
• English has the lowest average token count (≈260.9) and the highest characters per token (≈4.26).
• Ukrainian has the highest average token count (≈447.1).
• Korean has the lowest characters per token (≈1.58), i.e. the tokenizer packs characters least efficiently.
• meta-llama/llama-4-maverick is best on both metrics; microsoft/phi-4-reasoning-plus is worst on both.
• meta-llama/llama-4-maverick is a clear outlier for Korean: 302 tokens and ≈2.08 chars/token, significantly better than other models for that language.
• Across models, Ukrainian > Russian in tokens for the same content (≈447.1 vs ≈377.2).
Hidden patterns
• Ranking stability: English is the lowest-token language in 11/11 models; the highest-token language per model is Ukrainian in 7/11, Finnish in 2/11, Korean in 2/11. • Variability is driven by language: the standard deviation of tokens by language ranges from ≈21.7 to ≈89.4, dominating variation within a model. • Script effect: alphabetic Latin/Cyrillic languages pack more characters per token than Korean, across all models.
Summary — most and least efficient
• Overall efficiency (fewer tokens): • Language — most efficient: English; least: Ukrainian • Model — most efficient: meta-llama/llama-4-maverick; least: microsoft/phi-4-reasoning-plus
• Tokenizer efficiency (more characters per token): • Language — most efficient: English; least: Korean • Model — most efficient: meta-llama/llama-4-maverick; least: microsoft/phi-4-reasoning-plus