Home Services Operational Stewardship Incident Handling

Calm Incident Handling for Linux & Unix Systems

Incidents happen.
The work is deciding what to do next, and what not to make worse.

Last reviewed: March 2026

This Is Not a Fire Drill

Incident handling is the practice of responding to unexpected behavior in Linux and Unix-like production environments, in a way that stabilizes the system and preserves future options.

It is not about reacting as fast as possible. It is about maintaining control under incomplete, unreliable, and sometimes contradictory information.

Incidents rarely fail in isolation.
They involve interacting failures: resource exhaustion, cascading timeouts, dependency issues, and incorrect assumptions compounding under pressure.

Stabilize First

The first priority is to stop the situation from getting worse.

This may involve isolating components, disabling automation, reducing load, or reverting recent changes deliberately. Partial degradation is often preferable to uncontrolled failure.

Understand the Failure

Once the system is stable, attention shifts to understanding what failed and why.

This involves reconstructing events from logs, metrics, and recent changes, while validating assumptions against observed system behavior.

Make It Boring Again

Corrective actions are chosen to restore a predictable operating state without introducing new risk.

Permanent fixes, refactors, or improvements are often deferred until the system is calm and decisions can be made deliberately.

Human Judgment Matters

Incident handling cannot be fully automated.

Deciding whether to roll back, fail over, restore from backup, rebuild, or wait requires context and experience. Many severe outages are made worse by well-intentioned actions taken too early.

Communication During Incidents

During an incident, communication is factual, measured, and focused on what is known and unknown.

After resolution, incidents are documented: what happened, what was done, what assumptions failed, and what should change. This documentation feeds back into monitoring, backups, disaster recovery planning, and system design.

Scope and Limits

Incident handling is not a promise of availability.

Response depends on prior access, system familiarity, and existing operational commitments.

For a small number of long-term retainer clients, a dedicated contact channel may be explicitly agreed for rare, high-severity incidents.

This channel exists to support short-term stabilization during exceptional incidents. It does not imply continuous availability, guaranteed response times, or priority handling for non-critical work.

All other communication continues through standard channels.

These boundaries are written down in advance. No obligation exists to respond without prior agreement.

Misuse of this channel may result in its withdrawal.

Incidents Start Earlier

Effective incident handling relies on work done long before the incident: monitoring designed to surface real signals and trends, backups that can actually be restored, and conservative system design.

Most incident response effort is reduced by decisions made months earlier. That is where the real work happens.

Emergency Support

For organizations not currently under Operational Stewardship, emergency support provides short-term stabilization during active outages or severe system instability.

Learn More →

Systems and Tools Used

Incident handling involves working directly on production systems under uncertain conditions, where understanding current state matters more than tooling choice.

Resource exhaustion is analyzed using standard system utilities, allowing CPU, memory, and I/O pressure to be evaluated in real time.

Disk-related failures, including full filesystems and unexpected growth, are investigated through direct inspection of filesystem usage and process activity.

Network issues and service disruptions are examined at the socket and packet level, rather than inferred indirectly through external systems.

Logs are read in their original form to reconstruct timelines, validate assumptions, and understand system behavior as it actually occurred.

These situations may involve tools such as top, vmstat, lsof, tcpdump, and journalctl, but the emphasis is on interpretation rather than tooling.

Tools are used to observe and verify system state. Actions are taken deliberately, based on evidence rather than assumption.

Related Services

Incident handling is most effective when it is part of long-term operational responsibility, not an isolated or emergency-only service.

Operational Stewardship →