Skip to main content

Disaster Recovery

Purpose: For operators, documents disaster recovery procedures including RTO/RPO targets, recovery tiers, and step-by-step restoration for openCenter clusters.

RTO/RPO Targets

TierScenarioRPORTOStrategy
1Single worker node failure05 minAuto-healing via node replacement
2Single control plane node failure015 minetcd quorum maintained, node replacement
3Control plane quorum lossLast etcd snapshot30–60 minetcd restore from backup
4Full cluster lossLast Velero backup2–4 hoursFull re-provision + restore
5Site disaster (region loss)Last offsite backup4–8 hoursNew region provision + restore

Prerequisites

  • etcd snapshots available in S3 (configured via etcd-backup service)
  • Velero backups in S3-compatible storage
  • GitOps repository accessible (contains cluster configuration and manifests)
  • openCenter CLI with cluster configuration
  • Access to cloud provider APIs

Recovery Tier 1: Single Worker Node

Worker nodes are stateless. Kubernetes reschedules workloads automatically.

Automated Recovery

If the node does not recover within 5 minutes, Kubernetes marks pods as Terminating and reschedules them. No operator action required for stateless workloads.

Manual Node Replacement

# 1. Cordon and drain the failed node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s

# 2. Remove the node from the cluster
kubectl delete node <node-name>

# 3. Provision a replacement (via openCenter)
opencenter cluster deploy <cluster-name> --from-step opentofu-apply

# 4. Verify new node joins
kubectl get nodes -w

Recovery Tier 2: Single Control Plane Node

With 3+ control plane nodes, losing one maintains etcd quorum.

Recovery Steps

# 1. Verify etcd quorum is maintained
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://<healthy-cp-node>:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# 2. Remove failed etcd member
ETCDCTL_API=3 etcdctl member remove <member-id> \
--endpoints=https://<healthy-cp-node>:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# 3. Remove failed node from Kubernetes
kubectl delete node <failed-cp-node>

# 4. Provision replacement control plane node
# Update inventory and run Kubespray scale playbook
ansible-playbook -i inventory/hosts.yaml scale.yml --become --limit=<new-node>

# 5. Verify cluster health
kubectl get nodes
kubectl get pods -n kube-system

Recovery Tier 3: etcd Restore

When etcd quorum is lost (majority of control plane nodes down), restore from snapshot.

Restore Procedure

# 1. Stop kube-apiserver on all remaining control plane nodes
ssh <cp-node> "sudo systemctl stop kube-apiserver"

# 2. Download latest etcd snapshot from S3
aws s3 cp s3://<cluster>-etcd-backups/$(aws s3 ls s3://<cluster>-etcd-backups/ | sort | tail -1 | awk '{print $4}') /tmp/etcd-snapshot.db

# 3. Restore etcd on the first control plane node
sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restore \
--name=<node-name> \
--initial-cluster=<node-name>=https://<node-ip>:2380 \
--initial-advertise-peer-urls=https://<node-ip>:2380

# 4. Replace etcd data directory
sudo systemctl stop etcd
sudo mv /var/lib/etcd /var/lib/etcd.bak
sudo mv /var/lib/etcd-restore /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
sudo systemctl start etcd

# 5. Start kube-apiserver
sudo systemctl start kube-apiserver

# 6. Verify cluster state
kubectl get nodes
kubectl get pods -A

For multi-node etcd restore, repeat step 3 on each control plane node with the appropriate --initial-cluster configuration.

Verify etcd Health Post-Restore

ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

Recovery Tier 4: Full Cluster Restore

Complete cluster loss requires re-provisioning infrastructure and restoring application state.

Step 1: Re-provision Infrastructure

# Re-deploy cluster from existing configuration
opencenter cluster deploy <cluster-name> --restart

This runs through the full bootstrap: OpenTofu infrastructure provisioning, Kubespray K8s installation, and FluxCD bootstrap.

Step 2: GitOps Re-bootstrap

FluxCD reconciles the cluster state from Git automatically after bootstrap. Verify:

# Check FluxCD is reconciling
flux get kustomizations -A

# Wait for all services to deploy
watch kubectl get pods -A

Step 3: Restore Application Data (Velero)

# Verify Velero can access backups
velero backup-location get
velero backup get

# Restore from the latest backup
velero restore create full-restore \
--from-backup <latest-backup-name> \
--include-namespaces="*" \
--exclude-namespaces=velero,flux-system,kube-system \
--wait

# Check restore status
velero restore describe full-restore --details

Step 4: Restore Persistent Volumes

# Restore PVs separately if needed
velero restore create pv-restore \
--from-backup <latest-backup-name> \
--include-resources=persistentvolumes,persistentvolumeclaims \
--wait

Step 5: Verify

# All nodes ready
kubectl get nodes

# All platform services running
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Flux fully reconciled
flux get kustomizations -A --status-selector ready=false

# Application health
kubectl get deployments -A

Recovery Tier 5: Site Disaster (New Region)

Step 1: Prepare New Region Configuration

# Clone the cluster config for a new region
opencenter cluster init <new-cluster-name> \
--provider openstack \
--organization <org>

# Update configuration with new region parameters
opencenter cluster edit <new-cluster-name>

Update: region, network IDs, image IDs, floating IP pools, and DNS settings.

Step 2: Deploy to New Region

opencenter cluster generate <new-cluster-name>
opencenter cluster deploy <new-cluster-name>

Step 3: Restore Data

Follow the Velero restore procedure from Tier 4 (Steps 3–5). Velero backups stored in a different region/bucket are accessible from the new cluster.

Step 4: Update DNS

Point DNS records to the new cluster's ingress/load balancer IPs.

DR Testing

Test disaster recovery procedures quarterly. Procedure:

  1. Create a test cluster in a non-production environment
  2. Simulate failure (delete nodes, corrupt etcd, destroy infrastructure)
  3. Execute recovery following this runbook
  4. Measure RTO (time from failure detection to service restoration)
  5. Verify RPO (compare restored data with expected state)
  6. Document findings and update procedures

Automated DR Test Script

#!/bin/bash
# Run in a test environment only
CLUSTER="dr-test-cluster"
BACKUP=$(velero backup get -o json | jq -r '.items[-1].metadata.name')

echo "Testing restore from backup: $BACKUP"
echo "Start time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

velero restore create dr-test-$(date +%s) \
--from-backup "$BACKUP" \
--include-namespaces=production \
--wait

echo "Restore complete: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Verify critical services
kubectl get pods -n production --field-selector=status.phase!=Running

openCenter Backup Commands

# Create an on-demand cluster backup (config, keys, state)
opencenter cluster backup create <cluster-name>

# List available backups
opencenter cluster backup list <cluster-name>

# Restore from a backup
opencenter cluster backup restore <backup-id> --passphrase <passphrase>

# Schedule periodic backups
opencenter cluster backup schedule <cluster-name> --interval 6h --retention 720h