Cloudflare Incident

Matthew Prince (CEO of CF) released a post-mortem (https://blog.cloudflare.com/cloudflare-incident-november-18-2025/) regarding yesterday’s outage.

## TL;DR

On November 18, 2025, Cloudflare experienced a major outage due to an erroneous increase in the feature file for Bot Management, caused by a change in ClickHouse permissions. This led to 5xx errors and authentication issues. Primary recovery occurred by 14:30 UTC, with full restoration by 17:06 UTC. This was not an attack.

- Cause: A change in ClickHouse permissions caused duplicate columns in a query, leading the feature file for the ML bot model to suddenly grow and exceed the limit (200). This caused the FL2 proxy to crash with 5xx errors, while the old FL proxy assigned a bot score of 0 to all requests.
- Not a cyberattack: The status page was unavailable, which was initially misleading.

## Timeline (UTC)

- 11:05: DB access change applied.
- 11:28: Outage began; first 5xx errors.
- 11:32–13:05: Investigation; attempts to stabilize Workers KV.
- 13:05: Workarounds for Workers KV and Access (reversion to former proxy) implemented; impact reduced.
- 14:24: Generation/distribution of the bad file halted; old version verified.
- 14:30: Correct file deployed globally; major recovery.
- 17:06: All services restored.

## Next Steps

- Strengthen validation of internal configuration files, treating them as user input.
- Add global kill-switches for features.
- Protect resources from being overloaded by error dumps/reports.
- Review fault tolerance across all core proxy modules.

The most significant part of this incident is how long it took to fix. The fix itself took 6 minutes, but it took 3 hours to realize it was a configuration error rather than a DDoS. Improving these processes is notably absent from the post-mortem.

Was there an opportunity to confirm or debunk the attack with partners? Were all hypotheses regarding the cause generated? Were they explored by other people in parallel with the main investigation?