| korchasa@*ops

Matthew Prince (Cloudflare CEO) published a post‑mortem (https://blog.cloudflare.com/cloudflare-incident-november-18-2025/) about yesterday’s outage.

On November 18, 2025, Cloudflare suffered a major outage due to an accidental increase of the Bot Management feature file, triggered by a ClickHouse permissions change. This caused 5xx errors and authentication issues. Primary recovery completed by 14:30 UTC, full recovery by 17:06 UTC. This was not an attack.
- Cause: a ClickHouse permission change led to duplicate columns in a query; the ML feature file for the bot model suddenly grew past the limit (200). As a result the FL2 proxy crashed with 5xx, and the legacy FL proxy assigned a bot score of 0 to all requests.
- Not a cyberattack: the status page was unavailable, which was initially misleading.

Timeline (UTC):
- 11:05: DB access change applied.
- 11:28: outage starts; first 5xx.
- 11:32–13:05: investigation; attempts to stabilize Workers KV.
- 13:05: workarounds for Workers KV and Access (roll back to the previous proxy); impact reduced.
- 14:24: stopped generating/distributing the bad file; validated the old version.
- 14:30: deployed the correct file globally; primary recovery.
- 17:06: all services restored.

Next steps:
- Treat internal configuration files as user input and validate accordingly.
- Add global kill-switches for features.
- Protect resources from overload due to dumps/error reports.
- Revisit fault tolerance across all core proxy modules.

The most important part of this incident is how long it took to fix. The fix itself took 6 minutes, but it took 3 hours to realize it wasn’t a DDoS but a configuration mistake. And “process” is exactly what’s missing from the post‑mortem.

Was there a way to confirm/deny an attack with partners? Were all hypotheses about the cause generated? Were they explored by other people in parallel with the main investigation?