Incident handling involves working directly on production systems under uncertain
conditions, where understanding current state matters more than tooling choice.
Resource exhaustion is analyzed using standard system utilities, allowing CPU, memory,
and I/O pressure to be evaluated in real time. Disk-related failures, including full
filesystems and unexpected growth, are investigated through direct inspection of
filesystem usage and process activity.
Network issues and service disruptions are examined at the socket and packet level,
rather than inferred indirectly through external systems. Logs are read in their
original form to reconstruct timelines, validate assumptions, and understand system
behavior as it actually occurred.
Tools include top, vmstat, lsof,
tcpdump, and journalctl, but the emphasis is on interpretation
rather than tooling. Actions are taken deliberately, based on evidence rather than
assumption.