Operational Runbooks

Purpose: For operators, provides production runbooks for common incident scenarios organized by category and severity.

Runbook Format

Each runbook follows a standard structure:

Section	Content
Symptoms	Observable indicators (alerts, error messages, user reports)
Impact	What is affected and severity
Diagnosis	Commands to confirm root cause
Resolution	Step-by-step fix procedure
Verification	How to confirm the issue is resolved
Prevention	Actions to prevent recurrence

Severity Levels

Level	Response Time	Criteria	Examples
SEV-1 Critical	Immediate (< 15 min)	Cluster unreachable, data loss risk, full service outage	etcd quorum loss, API server down, all nodes NotReady
SEV-2 High	< 30 min	Degraded service, partial outage, single component failure	Control plane node down, FluxCD stuck, certificate expired
SEV-3 Medium	< 2 hours	Performance degradation, non-critical component failure	High memory pressure, backup failure, drift detected
SEV-4 Low	Next business day	Cosmetic issues, warnings, planned maintenance	Log volume growing, non-critical alert noise

Runbook Categories

Cluster Issues

Runbook	Severity	Trigger
etcd Quorum Loss	SEV-1	`etcdMembersDown` alert, API server 5xx
Control Plane Node Failure	SEV-2	`KubeNodeNotReady` on CP node
Worker Node Failure	SEV-3	`KubeNodeNotReady` on worker
API Server Unresponsive	SEV-1	`KubeAPIDown` alert
Kubelet Crash Loop	SEV-2	Node flapping Ready/NotReady

Networking

Runbook	Severity	Trigger
Pod-to-Pod Connectivity Loss	SEV-1	Calico/CNI failure
DNS Resolution Failure	SEV-2	CoreDNS pods down
Load Balancer Unhealthy	SEV-2	External traffic blocked
Ingress Certificate Expired	SEV-2	`CertManagerCertNotReady` alert

Storage

Runbook	Severity	Trigger
PersistentVolume Stuck	SEV-3	PVC in `Pending` state
etcd Disk Full	SEV-1	`etcdBackendQuotaLowSpace` alert
Backup Failure	SEV-3	`VeleroBackupFailure` alert
Volume Snapshot Failure	SEV-3	CSI snapshot errors

GitOps

Runbook	Severity	Trigger
FluxCD Reconciliation Failure	SEV-2	`FluxReconciliationFailure` alert
SOPS Decryption Failure	SEV-2	Kustomize controller errors
Git Source Unreachable	SEV-2	GitRepository not ready
Helm Release Failed	SEV-3	HelmRelease stuck

Security

Runbook	Severity	Trigger
Certificate Expired	SEV-1	TLS errors, API server cert invalid
Compromised Credentials	SEV-1	Suspicious activity, key leak
SOPS Key Compromise	SEV-1	Key material exposed
Unauthorized Access Detected	SEV-2	Audit log anomalies

On-Call Escalation

┌──────────────────────────────────────────────┐
│  Alert fires (Prometheus → Alertmanager)     │
└─────────────────────┬────────────────────────┘
                      │
                      ▼
┌──────────────────────────────────────────────┐
│  L1: On-call operator (< 15 min response)    │
│  • Acknowledge alert                         │
│  • Execute runbook                           │
│  • Escalate if unresolved in 30 min          │
└─────────────────────┬────────────────────────┘
                      │ (unresolved)
                      ▼
┌──────────────────────────────────────────────┐
│  L2: Platform engineer (< 30 min response)   │
│  • Deep diagnosis                            │
│  • Infrastructure-level fixes                │
│  • Escalate if unresolved in 1 hour          │
└─────────────────────┬────────────────────────┘
                      │ (unresolved)
                      ▼
┌──────────────────────────────────────────────┐
│  L3: Architecture / vendor support           │
│  • Root cause analysis                       │
│  • Vendor engagement if needed               │
└──────────────────────────────────────────────┘

Escalation Criteria

From	To	When
L1 → L2	30 min unresolved, or SEV-1	Runbook steps exhausted, infrastructure issue suspected
L2 → L3	1 hour unresolved, or data loss	Requires architectural change or vendor support

Quick Reference: Common Commands

# Cluster status
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
flux get kustomizations -A

# Logs
kubectl logs -n <namespace> deploy/<name> --since=10m
flux logs --level=error --since=10m
journalctl -u kubelet --since="10 minutes ago"  # on node

# Restart
kubectl rollout restart deployment/<name> -n <namespace>
flux reconcile kustomization <name> --with-source

# Drain/cordon
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Force reconciliation
flux reconcile source git flux-system
flux reconcile kustomization flux-system --with-source

Incident Response Template

Use this template when opening an incident ticket:

## Incident: [Title]

**Severity:** SEV-[1-4]
**Detected:** [timestamp]
**Resolved:** [timestamp]
**Duration:** [minutes]

### Summary
[One sentence describing the incident]

### Impact
[What was affected, how many users/services impacted]

### Timeline
- HH:MM — Alert fired
- HH:MM — Operator acknowledged
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service restored

### Root Cause
[Technical explanation]

### Resolution
[Steps taken to resolve]

### Action Items
- [ ] Prevention measure 1
- [ ] Prevention measure 2
- [ ] Runbook update

Health Checks — Proactive cluster verification
Disaster Recovery — Full recovery procedures
Troubleshooting: FluxCD Reconciliation
Troubleshooting: Secrets Decryption
Troubleshooting: Networking Issues

Runbook Details

etcd Quorum Loss

Runbook content planned — SEV-1 response procedure for etcd quorum loss.

Control Plane Node Failure

Runbook content planned — SEV-2 response procedure for control plane node failure.

Worker Node Failure

Runbook content planned — SEV-3 response procedure for worker node failure.

API Server Unresponsive

Runbook content planned — SEV-1 response procedure for unresponsive API server.

Kubelet Crash Loop

Runbook content planned — SEV-2 response procedure for kubelet crash looping.

Pod-to-Pod Connectivity Loss

Runbook content planned — SEV-1 response procedure for pod networking failure.

DNS Resolution Failure

Runbook content planned — SEV-2 response procedure for DNS resolution failure.

Load Balancer Unhealthy

Runbook content planned — SEV-2 response procedure for unhealthy load balancer.

Ingress Certificate Expired

Runbook content planned — SEV-2 response procedure for expired ingress certificate.

PersistentVolume Stuck

Runbook content planned — SEV-3 response procedure for stuck PersistentVolumes.

etcd Disk Full

Runbook content planned — SEV-1 response procedure for etcd disk full.

Backup Failure

Runbook content planned — SEV-3 response procedure for backup failure.

Volume Snapshot Failure

Runbook content planned — SEV-3 response procedure for volume snapshot failure.

FluxCD Reconciliation Failure

Runbook content planned — SEV-2 response procedure for FluxCD reconciliation failure.

SOPS Decryption Failure

Runbook content planned — SEV-2 response procedure for SOPS decryption failure.

Git Source Unreachable

Runbook content planned — SEV-2 response procedure for unreachable git source.

Helm Release Failed

Runbook content planned — SEV-3 response procedure for failed Helm release.

Certificate Expired

Runbook content planned — SEV-1 response procedure for expired certificates.

Compromised Credentials

Runbook content planned — SEV-1 response procedure for compromised credentials.

SOPS Key Compromise

Runbook content planned — SEV-1 response procedure for SOPS key compromise.

Unauthorized Access Detected

Runbook content planned — SEV-2 response procedure for unauthorized access.

Runbook Format​

Severity Levels​

Runbook Categories​

Cluster Issues​

Networking​

Storage​

GitOps​

Security​

On-Call Escalation​

Escalation Criteria​

Quick Reference: Common Commands​

Incident Response Template​

Related Docs​

Runbook Details​

etcd Quorum Loss​

Control Plane Node Failure​

Worker Node Failure​

API Server Unresponsive​

Kubelet Crash Loop​

Pod-to-Pod Connectivity Loss​

DNS Resolution Failure​

Load Balancer Unhealthy​

Ingress Certificate Expired​

PersistentVolume Stuck​

etcd Disk Full​

Backup Failure​

Volume Snapshot Failure​

FluxCD Reconciliation Failure​

SOPS Decryption Failure​

Git Source Unreachable​

Helm Release Failed​

Certificate Expired​

Compromised Credentials​

SOPS Key Compromise​

Unauthorized Access Detected​