Glossary
Definitions of terms and concepts used in ongoing infrastructure management.
Designed to clarify expectations, roles, and operational practices.
Purpose
This glossary provides a shared vocabulary for discussions about systems, operations, monitoring, backups, disaster recovery, and infrastructure management. Clear definitions help reduce misunderstandings and set expectations.
Infrastructure & Monitoring
Critical Systems
Components or services whose failure has a significant operational or business impact. Prioritized during recovery and incident response.
Health Check
A diagnostic check of a host, service, or system to verify availability, performance, and operational assumptions.
Monitoring
The process of observing infrastructure, services, and selected application signals to detect anomalies early and preserve context. Designed for prevention, clarity, and informed response, not constant alerting.
Unix
A family of operating systems sharing common design principles. In modern infrastructure, Linux is the most common member, while other Unix systems typically appear in legacy, networking, or specialized environments.
Centralized Logging
Aggregation of logs from multiple systems into a single location to provide durable evidence of events, support analysis, and reconstruct system behavior during incidents.
Operational Metrics
Key measurements and signals collected to understand system health and trends, supporting decision-making and incident response rather than guaranteeing uptime.
Threshold Drift
The gradual misalignment of alert thresholds or metrics due to changing system behavior or workload, which can reduce the relevance of automated monitoring without regular review.
False Urgency
Situations where automated alerts, dashboards, or monitoring systems create unnecessary alarm, prompting unproductive or harmful responses. Treated as a failure mode to minimize operational noise.
Backup & Recovery
Backup
A copy of data or system state intended to support recovery if the original becomes unavailable or corrupted. Backups are only useful if they can be restored under realistic failure conditions.
Restore Testing
The process of validating backups or recovery procedures by performing controlled restores, ensuring that recovery works under realistic conditions.
Recovery Strategy
A documented approach to restore systems after a failure, considering trade-offs between recovery time, data loss, complexity, and operational risk. Strategies include restores from backups, activating standby systems, and partial service recovery.
Retention Policy
Rules defining how long backups are kept, balancing recovery needs, storage costs, and long-term operational responsibility.
Standby Infrastructure
Secondary or alternate systems maintained to improve recovery outcomes. Operated separately from production and activated deliberately when needed.
Disaster Recovery Plan (DR Plan)
A documented set of procedures, decision paths, and recovery options used when systems fail beyond normal incident handling. It prioritizes clarity, realism, and operational feasibility over exhaustive completeness.
Disaster
An event that exceeds the assumptions of normal operations and routine incident handling. Examples include total loss of a primary environment, severe data corruption, or human error with irreversible impact.
Failover
The deliberate switch to a standby system or backup resource to maintain continuity during a failure or outage. Failover is a recovery method, not a guarantee of service continuity.
Recovery Path
A predefined sequence of steps, decisions, and options used to bring a system back to operational state after a failure, consistent with priorities and available resources.
Private Mesh Network
A dedicated, isolated network used to transport monitoring and backup traffic across multiple regions or providers, ensuring redundancy, operational control, and reduced single points of failure.
Time-Series Metrics
Historical measurements of system performance (CPU, memory, disk, latency, etc.) over time, used to detect trends, gradual degradation, or unusual behavior.
Responsibility & Security
Operational Responsibility
The accountability for maintaining, monitoring, backing up, and recovering infrastructure over the long term, including clear
decision-making, documentation, and controlled risk management.
Operational responsibility within an active engagement includes adherence to NDAs and confidentiality agreements,
ensuring client information is protected throughout the engagement.
Operational Assumptions
Explicit statements about system state, access, dependencies, data integrity, and external services assumed to be true
during normal operations and recovery.
When assumptions no longer hold, recovery paths may change or become unavailable.
Incident Handling
The deliberate, reversible process of diagnosing, mitigating, and resolving operational issues. Focuses on calm, human-driven decisions rather than reactive heroics.
Change Management
A deliberate, documented process for deploying updates, configuration changes, and migrations that prioritizes reversibility, risk reduction, and predictability.
Ransomware Mitigation
Measures, including isolated backups and recovery strategies, designed to reduce the operational impact of ransomware incidents, recognizing that no system can fully guarantee protection.
Confidential Information
Any client data, configuration details, operational procedures, access credentials, or system documentation that is not publicly available. Sharing or using confidential information outside the agreed engagement or NDA terms is prohibited.
NDA (Non-Disclosure Agreement)
A contractual agreement or formal understanding that restricts disclosure of sensitive information related to systems, operations, and client data. NDAs ensure that operational details, credentials, architecture, and incident information remain confidential and are only shared with authorized personnel. Compliance with NDAs is part of responsible operational practice.
For a lighter take on operational flow, chaos, and recovery, see:
Infrastructure Feng Shui -
Weird Infrastructure Incidents.
Related Resources
Understanding these terms supports better collaboration, incident response, and operational clarity.
Operational Stewardship →