Calm Incident Response for Linux & Unix Systems

Description

this is not a fire drill

§ 1

Incident handling is the practice of responding to unexpected behavior in Linux and Unix-like production environments, in a way that stabilizes the system and preserves future options.

It is not about reacting as fast as possible. It is about maintaining control under incomplete, unreliable, and sometimes contradictory information.

Incidents rarely fail in isolation. They involve interacting failures: resource exhaustion, cascading timeouts, dependency issues, and incorrect assumptions compounding under pressure.

The three steps

§ 2

01

Stabilize First

The first priority is to stop the situation from getting worse. This may involve isolating components, disabling automation, reducing load, or reverting recent changes deliberately. Partial degradation is often preferable to uncontrolled failure.

02

Understand the Failure

Once stable, attention shifts to understanding what failed and why. Reconstruct events from logs, metrics, and recent changes, validating assumptions against observed system behavior.

03

Make It Boring Again

Corrective actions are chosen to restore a predictable state without introducing new risk. Permanent fixes, refactors, or improvements are often deferred until the system is calm and decisions can be made deliberately.

Human judgment matters

§ 3

Incident handling cannot be fully automated. Deciding whether to roll back, fail over, restore from backup, rebuild, or wait requires context and experience. Many severe outages are made worse by well-intentioned actions taken too early.

Communication during incidents

§ 4

During an incident, communication is factual, measured, and focused on what is known and unknown.

After resolution, incidents are documented: what happened, what was done, what assumptions failed, and what should change. This documentation feeds back into monitoring, backups, disaster recovery planning, and system design.

Scope and limits

§ 5

Incident handling is not a promise of availability. Response depends on prior access, system familiarity, and existing operational commitments.

For a small number of long-term retainer clients, a dedicated contact channel may be explicitly agreed for rare, high-severity incidents. This channel exists to support short-term stabilization during exceptional incidents. It does not imply continuous availability, guaranteed response times, or priority handling for non-critical work.

All other communication continues through standard channels. These boundaries are written down in advance. No obligation exists to respond without prior agreement. Misuse of this channel may result in its withdrawal.

Incidents start earlier

§ 6

Effective incident handling relies on work done long before the incident: monitoring designed to surface real signals and trends, backups that can actually be restored, and conservative system design.

Most incident response effort is reduced by decisions made months earlier. That is where the real work happens.

Emergency support

§ 7

For organizations not currently under Operational Stewardship, emergency support provides short-term stabilization during active outages or severe system instability.

Learn more →

Tools

§ 8

Incident handling involves working directly on production systems under uncertain conditions, where understanding current state matters more than tooling choice.

Resource exhaustion is analyzed using standard system utilities, allowing CPU, memory, and I/O pressure to be evaluated in real time. Disk-related failures, including full filesystems and unexpected growth, are investigated through direct inspection of filesystem usage and process activity.

Network issues and service disruptions are examined at the socket and packet level, rather than inferred indirectly through external systems. Logs are read in their original form to reconstruct timelines, validate assumptions, and understand system behavior as it actually occurred.

Tools include top, vmstat, lsof, tcpdump, and journalctl, but the emphasis is on interpretation rather than tooling. Actions are taken deliberately, based on evidence rather than assumption.

In practice, this means

§ 9

When something breaks, the first action is containment, not an immediate fix.
During an incident, communication is factual: what is known, what is not, what is being done.
Actions are deliberate. Well-intentioned changes that worsen the situation are avoided.
After resolution, a short written record captures what happened, what assumptions failed, and what should change.

FAQ

§ 10

What is the first thing you do during an incident?

Stabilize before diagnosing. The first priority is to stop the situation from getting worse and restore a safe baseline, even before the root cause is fully understood. Investigation follows once the system is no longer degrading.

How do you communicate during an outage?

Clearly and without speculation. Updates describe what is known, what is being done, and what remains uncertain. Silence and guesswork both erode trust during an incident, so neither is used.

When does an incident actually end?

Not when the symptom disappears. An incident ends when the system is stable, the cause is understood well enough to prevent immediate recurrence, and a short written record captures what happened and what should change.

Calm Incident Handling for Linux & Unix Systems