Rule 01
Control Is Always Partial
No matter how good you are, reliability depends on upstream providers, hardware supply chains, networks you don’t own, client behavior, budget constraints, and change velocity outside your control. An SLA converts partial control into full liability.
Rule 02
Ownership Is Singular
Every system has one owner when it breaks. Committees do not debug outages. Clear ownership reduces hesitation, shortens incidents, and makes trade-offs explicit. Single ownership trades throughput for accountability. Under failure, accountability wins.
Rule 03
Failure Has Borders
Outages stop somewhere on purpose. Failure is inevitable; propagation is optional. Well-designed systems define boundaries early so faults degrade locally instead of cascading. Unrelated services stay boring, operators keep access, and recovery remains possible.
Rule 04
Scale Does Not Eliminate Failure
Even large providers experience outages. Size reduces some risks, but visibility increases pressure. What matters is how failure is communicated, contained, and resolved.
Rule 05
Monitoring Tells the Truth
If dashboards lie, decisions will too. Monitoring exists to reduce arguments, not to provide comfort. Partial failure must be visible, even when it is inconvenient.
Rule 06
Recovery Is Designed
Hope is not a recovery strategy. Backups, restores, and procedures are part of the system design, not an afterthought added once things hurt.
Rule 07
Systems Drift by Default
Stability requires continuous correction. Access accumulates. Assumptions age. Documentation decays. Drift is normal; ignoring it is not.
Rule 08
Documentation Reflects Reality
If it is not written, it is folklore. Documentation must describe what actually exists, not what was intended, imagined, or promised.
Rule 09
Access Is a Liability
Credentials age badly. Access should be deliberate, reviewed, and revocable. Forgotten keys are a common cause of quiet disasters.
Rule 10
Traffic Is Separated
Control paths are boring on purpose. Management traffic does not compete with production traffic. When things fail, operators still need a way in.
Rule 11
Silence Is a Signal
Missing data still means something. Silence can indicate stability, or broken visibility. Systems must distinguish between the two.
Rule 12
Restraint Compounds
Right-sizing is the default. Smaller systems use less of everything: less power, less attention, less surface area for things to go wrong. Spare capacity is intentional, not insurance against unclear thinking. Lower energy use, fewer alerts, fewer migrations: all follow from the same discipline.