| korchasa@*ops

A short “squeeze” of Richard Cook’s text about how complex systems fail:
# How Complex Systems Fail

## 1. All complex systems are hazardous
Transportation, medicine, energy, and other critical systems inevitably contain risk. We can reduce how often we touch hazards, but we can’t eliminate them entirely. That’s why defenses are built.

## 2. Such systems have multiple layers of defense
To prevent accidents, we build technical, human, and organizational safeguards: redundant equipment, training, procedures, rules, etc.

## 3. Disasters require multiple failures at once
A single failure rarely causes an accident. A disaster is usually a chain of small failures that individually seem insignificant. Most of these chains get stopped in time.

## 4. The system always contains latent flaws
Complex systems can’t be perfect. Small issues always exist, and they change with new technologies and new ways of working.

## 5. Systems run in a “broken” state
Systems often keep working despite failures — thanks to slack and people’s effort. Often there were similar incidents before the accident that almost ended badly.

## 6. The accident is always nearby
Complex systems can fail at any moment. You can’t fully prevent this.

## 7. There is no single “root cause”
A catastrophe is the result of many factors. Assigning one “root cause” is wrong — it’s more of an attempt to find someone to blame.

## 8. Knowing the outcome biases analysis
After an accident it seems the signs were obvious. In reality, before it happened, many things looked different.

## 9. Operators both produce and protect the system
People in the system simultaneously get work done and prevent failures. It’s a constant balancing act.

## 10. All actions are bets
Operators act under uncertainty. Successes and failures are outcomes of such bets. After an accident it’s easy to forget that.

## 11. Ambiguity is resolved on the ground
Management often leaves conflicting demands: produce more, but with minimal risk. People on the ground must decide how to act. After an accident it’s easy to judge their decisions.

## 12. People are the main adaptive element
Operators continuously adapt: redistribute resources, plan retreats, notice changes, and respond.

## 13. Expertise in the system is constantly changing
Workers’ knowledge and skills evolve with technology and generations. The system must develop expertise and use it where it’s needed most.

## 14. New technologies can bring new catastrophes
Technology can remove small problems but create conditions for rare, large failures. These risks are often invisible at first.

## 15. “Cause” thinking gets in the way of improvement
After an accident people try to eliminate “human errors”. But such measures rarely prevent new incidents and often make the system more complex, adding new weaknesses.

## 16. Safety is a property of the whole system
Safety is not a separate component, device, or person. It emerges from the interaction of all elements and constantly changes.

## 17. People create safety while doing the work
Every day operators keep the system from failing through their actions. It’s often invisible, but it’s what makes the system reliable.

## 18. To avoid failures, you must know them
The better operators feel the boundary between normal operation and the danger zone, the better they manage risk. Experience with failure helps safe operation.