Operational Runbooks
Purpose: For operators, provides production runbooks for common incident scenarios organized by category and severity.
Runbook Format
Each runbook follows a standard structure:
| Section | Content |
|---|---|
| Symptoms | Observable indicators (alerts, error messages, user reports) |
| Impact | What is affected and severity |
| Diagnosis | Commands to confirm root cause |
| Resolution | Step-by-step fix procedure |
| Verification | How to confirm the issue is resolved |
| Prevention | Actions to prevent recurrence |
Severity Levels
| Level | Response Time | Criteria | Examples |
|---|---|---|---|
| SEV-1 Critical | Immediate (< 15 min) | Cluster unreachable, data loss risk, full service outage | etcd quorum loss, API server down, all nodes NotReady |
| SEV-2 High | < 30 min | Degraded service, partial outage, single component failure | Control plane node down, FluxCD stuck, certificate expired |
| SEV-3 Medium | < 2 hours | Performance degradation, non-critical component failure | High memory pressure, backup failure, drift detected |
| SEV-4 Low | Next business day | Cosmetic issues, warnings, planned maintenance | Log volume growing, non-critical alert noise |
Runbook Categories
Cluster Issues
| Runbook | Severity | Trigger |
|---|---|---|
| etcd Quorum Loss | SEV-1 | etcdMembersDown alert, API server 5xx |
| Control Plane Node Failure | SEV-2 | KubeNodeNotReady on CP node |
| Worker Node Failure | SEV-3 | KubeNodeNotReady on worker |
| API Server Unresponsive | SEV-1 | KubeAPIDown alert |
| Kubelet Crash Loop | SEV-2 | Node flapping Ready/NotReady |
Networking
| Runbook | Severity | Trigger |
|---|---|---|
| Pod-to-Pod Connectivity Loss | SEV-1 | Calico/CNI failure |
| DNS Resolution Failure | SEV-2 | CoreDNS pods down |
| Load Balancer Unhealthy | SEV-2 | External traffic blocked |
| Ingress Certificate Expired | SEV-2 | CertManagerCertNotReady alert |
Storage
| Runbook | Severity | Trigger |
|---|---|---|
| PersistentVolume Stuck | SEV-3 | PVC in Pending state |
| etcd Disk Full | SEV-1 | etcdBackendQuotaLowSpace alert |
| Backup Failure | SEV-3 | VeleroBackupFailure alert |
| Volume Snapshot Failure | SEV-3 | CSI snapshot errors |
GitOps
| Runbook | Severity | Trigger |
|---|---|---|
| FluxCD Reconciliation Failure | SEV-2 | FluxReconciliationFailure alert |
| SOPS Decryption Failure | SEV-2 | Kustomize controller errors |
| Git Source Unreachable | SEV-2 | GitRepository not ready |
| Helm Release Failed | SEV-3 | HelmRelease stuck |
Security
| Runbook | Severity | Trigger |
|---|---|---|
| Certificate Expired | SEV-1 | TLS errors, API server cert invalid |
| Compromised Credentials | SEV-1 | Suspicious activity, key leak |
| SOPS Key Compromise | SEV-1 | Key material exposed |
| Unauthorized Access Detected | SEV-2 | Audit log anomalies |
On-Call Escalation
┌──────────────────────────────────────────────┐
│ Alert fires (Prometheus → Alertmanager) │
└─────────────────────┬────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ L1: On-call operator (< 15 min response) │
│ • Acknowledge alert │
│ • Execute runbook │
│ • Escalate if unresolved in 30 min │
└─────────────────────┬────────────────────────┘
│ (unresolved)
▼
┌──────────────────────────────────────────────┐
│ L2: Platform engineer (< 30 min response) │
│ • Deep diagnosis │
│ • Infrastructure-level fixes │
│ • Escalate if unresolved in 1 hour │
└─────────────────────┬────────────────────────┘
│ (unresolved)
▼
┌──────────────────────────────────────────────┐
│ L3: Architecture / vendor support │
│ • Root cause analysis │
│ • Vendor engagement if needed │
└──────────────────────────────────────────────┘
Escalation Criteria
| From | To | When |
|---|---|---|
| L1 → L2 | 30 min unresolved, or SEV-1 | Runbook steps exhausted, infrastructure issue suspected |
| L2 → L3 | 1 hour unresolved, or data loss | Requires architectural change or vendor support |
Quick Reference: Common Commands
# Cluster status
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
flux get kustomizations -A
# Logs
kubectl logs -n <namespace> deploy/<name> --since=10m
flux logs --level=error --since=10m
journalctl -u kubelet --since="10 minutes ago" # on node
# Restart
kubectl rollout restart deployment/<name> -n <namespace>
flux reconcile kustomization <name> --with-source
# Drain/cordon
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Force reconciliation
flux reconcile source git flux-system
flux reconcile kustomization flux-system --with-source
Incident Response Template
Use this template when opening an incident ticket:
## Incident: [Title]
**Severity:** SEV-[1-4]
**Detected:** [timestamp]
**Resolved:** [timestamp]
**Duration:** [minutes]
### Summary
[One sentence describing the incident]
### Impact
[What was affected, how many users/services impacted]
### Timeline
- HH:MM — Alert fired
- HH:MM — Operator acknowledged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored
### Root Cause
[Technical explanation]
### Resolution
[Steps taken to resolve]
### Action Items
- [ ] Prevention measure 1
- [ ] Prevention measure 2
- [ ] Runbook update
Related Docs
- Health Checks — Proactive cluster verification
- Disaster Recovery — Full recovery procedures
- Troubleshooting: FluxCD Reconciliation
- Troubleshooting: Secrets Decryption
- Troubleshooting: Networking Issues
Runbook Details
etcd Quorum Loss
Runbook content planned — SEV-1 response procedure for etcd quorum loss.
Control Plane Node Failure
Runbook content planned — SEV-2 response procedure for control plane node failure.
Worker Node Failure
Runbook content planned — SEV-3 response procedure for worker node failure.
API Server Unresponsive
Runbook content planned — SEV-1 response procedure for unresponsive API server.
Kubelet Crash Loop
Runbook content planned — SEV-2 response procedure for kubelet crash looping.
Pod-to-Pod Connectivity Loss
Runbook content planned — SEV-1 response procedure for pod networking failure.
DNS Resolution Failure
Runbook content planned — SEV-2 response procedure for DNS resolution failure.
Load Balancer Unhealthy
Runbook content planned — SEV-2 response procedure for unhealthy load balancer.
Ingress Certificate Expired
Runbook content planned — SEV-2 response procedure for expired ingress certificate.
PersistentVolume Stuck
Runbook content planned — SEV-3 response procedure for stuck PersistentVolumes.
etcd Disk Full
Runbook content planned — SEV-1 response procedure for etcd disk full.
Backup Failure
Runbook content planned — SEV-3 response procedure for backup failure.
Volume Snapshot Failure
Runbook content planned — SEV-3 response procedure for volume snapshot failure.
FluxCD Reconciliation Failure
Runbook content planned — SEV-2 response procedure for FluxCD reconciliation failure.
SOPS Decryption Failure
Runbook content planned — SEV-2 response procedure for SOPS decryption failure.
Git Source Unreachable
Runbook content planned — SEV-2 response procedure for unreachable git source.
Helm Release Failed
Runbook content planned — SEV-3 response procedure for failed Helm release.
Certificate Expired
Runbook content planned — SEV-1 response procedure for expired certificates.
Compromised Credentials
Runbook content planned — SEV-1 response procedure for compromised credentials.
SOPS Key Compromise
Runbook content planned — SEV-1 response procedure for SOPS key compromise.
Unauthorized Access Detected
Runbook content planned — SEV-2 response procedure for unauthorized access.