Home · Insights · Glossary
Linux Infrastructure Glossary
Definitions of terms and concepts used in ongoing infrastructure management. Designed to clarify expectations, roles, and operational practices.
Purpose
This glossary provides a shared vocabulary for discussions about systems, operations, monitoring, backups, disaster recovery, and infrastructure management. Clear definitions help reduce misunderstandings and set expectations.
Infrastructure & Monitoring
- Critical Systems
- Components or services whose failure has a significant operational or business impact. Prioritized during recovery and incident response.
- Health Check
- A diagnostic check of a host, service, or system to verify availability, performance, and operational assumptions.
- Monitoring
- The process of observing infrastructure, services, and selected application signals to detect anomalies early and preserve context. Designed for prevention, clarity, and informed response, not constant alerting.
- Centralized Logging
- Aggregation of logs from multiple systems into a single location to provide durable evidence of events, support analysis, and reconstruct system behavior during incidents.
- Operational Metrics
- Key measurements and signals collected to understand system health and trends, supporting decision-making and incident response rather than guaranteeing uptime.
- Time-Series Metrics
- Historical measurements of system performance (CPU, memory, disk, latency, etc.) over time, used to detect trends, gradual degradation, or unusual behavior.
- Threshold Drift
- The gradual misalignment of alert thresholds or metrics due to changing system behavior or workload, which can reduce the relevance of automated monitoring without regular review.
- False Urgency
- Situations where automated alerts, dashboards, or monitoring systems create unnecessary alarm, prompting unproductive or harmful responses. Treated as a failure mode to minimize operational noise.
- Unix
- A family of operating systems sharing common design principles. In modern infrastructure, Linux is the most common member, while other Unix systems typically appear in legacy, networking, or specialized environments.
Backup & Recovery
- Backup
- A copy of data or system state intended to support recovery if the original becomes unavailable or corrupted. Backups are only useful if they can be restored under realistic failure conditions.
- Restore Testing
- The process of validating backups or recovery procedures by performing controlled restores, ensuring that recovery works under realistic conditions.
- Recovery Strategy
- A documented approach to restore systems after a failure, considering trade-offs between recovery time, data loss, complexity, and operational risk. Strategies include restores from backups, activating standby systems, and partial service recovery.
- Retention Policy
- Rules defining how long backups are kept, balancing recovery needs, storage costs, and long-term operational responsibility.
- Standby Infrastructure
- Secondary or alternate systems maintained to improve recovery outcomes. Operated separately from production and activated deliberately when needed.
- Disaster Recovery Plan
- A documented set of procedures, decision paths, and recovery options used when systems fail beyond normal incident handling. Prioritizes clarity, realism, and operational feasibility over exhaustive completeness.
- Disaster
- An event that exceeds the assumptions of normal operations and routine incident handling. Examples: total loss of a primary environment, severe data corruption, or human error with irreversible impact.
- Failover
- The deliberate switch to a standby system or backup resource to maintain continuity during a failure or outage. A recovery method, not a guarantee.
- Recovery Path
- A predefined sequence of steps, decisions, and options used to bring a system back to an operational state after a failure, consistent with priorities and available resources.
- Private Mesh Network
- A dedicated, isolated network used to transport monitoring and backup traffic across multiple regions or providers, ensuring redundancy, operational control, and reduced single points of failure.
Responsibility & Security
- Operational Responsibility
- The accountability for maintaining, monitoring, backing up, and recovering infrastructure over the long term, including clear decision-making, documentation, and controlled risk management. Within an active engagement, includes adherence to NDAs and confidentiality.
- Operational Assumptions
- Explicit statements about system state, access, dependencies, data integrity, and external services assumed to be true during normal operations and recovery. When assumptions no longer hold, recovery paths may change or become unavailable.
- Incident Handling
- The deliberate, reversible process of diagnosing, mitigating, and resolving operational issues. Focuses on calm, human-driven decisions rather than reactive heroics.
- Change Management
- A deliberate, documented process for deploying updates, configuration changes, and migrations that prioritizes reversibility, risk reduction, and predictability.
- Ransomware Mitigation
- Measures, including isolated backups and recovery strategies, designed to reduce the operational impact of ransomware incidents, recognizing that no system can fully guarantee protection.
- Confidential Information
- Any client data, configuration details, operational procedures, access credentials, or system documentation that is not publicly available. Sharing outside agreed engagement or NDA terms is prohibited.
- NDA (Non-Disclosure Agreement)
- A contractual agreement or formal understanding that restricts disclosure of sensitive information related to systems, operations, and client data. Compliance with NDAs is part of responsible operational practice.
See also
For a lighter take on operational flow, chaos, and recovery, see Calm Infrastructure and Weird Incidents.
Understanding these terms supports better collaboration, incident response, and operational clarity.