Symbols per token in different LLM models
Symbols per token in different LLM models for different languages
| Language / Model | GPT-4o (2024-08-06) | GPT-4o-mini (2024-07-18) | GPT-4 | Claude-3-5-Sonnet (2024-06-20) |
|---|---|---|---|---|
| English | Symbols: 4.63 | Symbols: 4.60 | Symbols: 4.77 | Symbols: 4.41 |
| Words: 0.77 | Words: 0.77 | Words: 0.79 | Words: 0.74 | |
| French | Symbols: 3.65 | Symbols: 3.65 | Symbols: 3.29 | Symbols: 2.81 |
| Words: 0.71 | Words: 0.71 | Words: 0.64 | Words: 0.55 | |
| Romanian | Symbols: 3.75 | Symbols: 3.75 | Symbols: 3.36 | Symbols: 3.34 |
| Words: 0.59 | Words: 0.59 | Words: 0.53 | Words: 0.52 | |
| Russian | Symbols: 3.93 | Symbols: 3.90 | Symbols: 2.57 | Symbols: 2.74 |
| Words: 0.55 | Words: 0.55 | Words: 0.36 | Words: 0.39 | |
| Ukrainian | Symbols: 2.59 | Symbols: 2.59 | Symbols: 1.64 | Symbols: 2.12 |
| Words: 0.45 | Words: 0.45 | Words: 0.28 | Words: 0.37 |
While preparing a presentation for the employees, I had to analyze the current “expressiveness” of tokens for different languages. The conclusions are as follows:
- Responses in some languages cost three times more than in English.
- Comparing models by price per token is only possible for the English language. For other languages, the cost needs to be recalculated into the price per character, word, or sentence.
- In some cases, it might be more cost-effective to translate the request and response into English. However, this needs to be carefully calculated, especially for the responses.
- It is difficult to predict which language is “more expensive” and which is “cheaper.” Calculations are necessary.
These conclusions will help us better understand and optimize the costs of using different languages in our models.
- Token efficiency and linguistic structure: English shows that LLMs process it more efficiently than Russian or Ukrainian.
- Impact of model design on multilingual processing: Claude-3-5-Sonnet may tokenize text into smaller units.
- Consequences for cost and speed: Lower symbol-to-token ratios lead to higher processing costs for Cyrillic languages.
- Alphabet and tokenization mechanisms: Cyrillic symbols are handled less efficiently by current tokenizers.
- Morphological complexity vs. tokenization: Simpler structures tokenize more efficiently.