opus 4.7: tokenizer changes and rising costs

date: 2026-04-17

tags: [#anthropic, #claude, #tokens, #models, #benchmark ]

draft: false

---

Anthropic updated the tokenizer in opus 4.7. The official announcement states that input grows by 1.0–1.35× in tokens. I was hoping they had lowered the language tax and improved expressiveness for non-latin languages, but it looks like this is just vocabulary reduction.

Test results

Comparison of Opus 4.6, 4.7 and Haiku 4.5 across 53 languages and 12 data types:

Unchanged: Russian, Arabic, Hebrew, Hindi, CJK languages, digits, whitespace and JSON.
Token count growth:
- English prose: +31%
- Source code: +22%
- Markdown: +21%
- camelCase identifiers (e.g. getUserByEmail): +51%

It looks like long chains of English words (BPE merges) were removed from the vocabulary. The model now uses several short tokens instead of one long one.

Since no language or popular format showed improvements, the reason lies elsewhere. Hypotheses:

Multimodality: reserving slots for image processing (input expanded 3×) and audio.
Architecture: reducing physical vocabulary size to optimize model weights.

Bottom line

Higher costs: working with English text and code became 20–28% more expensive. Price per 1M tokens stayed the same ($5/$25), but more tokens are now needed for the same text.
Context compression: effective context window shrank by 20%. English text that used to fit in 200,000 tokens now takes about 240,000.

Also updated tokenizer benchmarks at https://tokenizers.korchasa.dev/