Backup and Restore :: openCenter Community Documentation

Purpose: For operators, shows how to configure etcd backups and disaster recovery using Velero for complete cluster backup and restore capabilities.

This guide covers configuring automated etcd backups, Velero for application backups, and complete disaster recovery procedures.

Prerequisites

Existing openCenter cluster
S3-compatible storage (AWS S3, MinIO, Ceph, etc.)
S3 credentials (access key, secret key)
Basic understanding of Kubernetes resources

Task Summary

Configure automated backups for cluster state (etcd) and application data (persistent volumes, resources) to enable disaster recovery and cluster migration.

Backup Strategy

openCenter provides two complementary backup solutions:

etcd Backup: Cluster state (API objects, configurations)
Velero Backup: Application data (persistent volumes, resources)

Why both:

etcd backup: Fast cluster state recovery
Velero backup: Application-level backup with PV snapshots
Together: Complete disaster recovery capability

Part 1: Configure etcd Backup

1. Edit Cluster Configuration

Enable etcd backup service:

opencenter cluster edit my-cluster

2. Configure etcd Backup

Add etcd backup configuration:

opencenter:
  services:
    etcd-backup:
      enabled: true

      # S3 configuration
      s3_endpoint: "s3.amazonaws.com"  # Or your S3-compatible endpoint
      s3_bucket: "my-cluster-etcd-backups"
      s3_region: "us-east-1"
      s3_access_key: "AKIAIOSFODNN7EXAMPLE"  # Will be encrypted with SOPS
      s3_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"  # Will be encrypted

      # Backup schedule (cron format)
      schedule: "0 2 * * *"  # Daily at 2 AM

      # Retention policy
      retention_days: 30  # Keep backups for 30 days

      # Backup compression
      compression: true

Configuration options:

s3_endpoint: S3 API endpoint
s3_bucket: S3 bucket name (must exist)
s3_region: S3 region
s3_access_key: S3 access key (encrypted with SOPS)
s3_secret_key: S3 secret key (encrypted with SOPS)
schedule: Cron schedule for automated backups
retention_days: Number of days to keep backups
compression: Enable gzip compression

Evidence: internal/config/defaults.go:308-314 etcd-backup service

3. Apply Configuration

Render and apply the configuration:

# Render configuration
opencenter cluster generate my-cluster

# Commit to Git
cd ~/my-cluster-gitops
git add .
git commit -m "Enable etcd backup"
git push

# FluxCD will reconcile automatically (5-15 minutes)
# Or force reconciliation:
flux reconcile kustomization etcd-backup-base

4. Verify etcd Backup

Verify etcd backup is running:

# Check etcd-backup pods
kubectl get pods -n kube-system -l app=etcd-backup

# Check backup CronJob
kubectl get cronjob -n kube-system etcd-backup

# Check recent backups in S3
aws s3 ls s3://my-cluster-etcd-backups/

# Expected output:
# 2026-02-17-02-00-00-etcd-snapshot.db.gz
# 2026-02-16-02-00-00-etcd-snapshot.db.gz
# 2026-02-15-02-00-00-etcd-snapshot.db.gz

5. Test etcd Backup

Trigger manual backup:

# Create manual backup job
kubectl create job --from=cronjob/etcd-backup etcd-backup-manual -n kube-system

# Watch backup progress
kubectl logs -n kube-system job/etcd-backup-manual -f

# Verify backup in S3
aws s3 ls s3://my-cluster-etcd-backups/ | grep manual

Part 2: Configure Velero Backup

1. Enable Velero Service

Enable Velero in cluster configuration:

opencenter cluster edit my-cluster

opencenter:
  services:
    velero:
      enabled: true

      # S3 configuration
      s3_bucket: "my-cluster-velero-backups"
      s3_region: "us-east-1"
      s3_endpoint: "s3.amazonaws.com"

      # Backup schedule
      backup_schedule: "0 3 * * *"  # Daily at 3 AM

      # Retention policy
      retention_days: 30

      # Volume snapshot location (provider-specific)
      volume_snapshot_location:
        provider: aws  # or openstack, vsphere
        config:
          region: us-east-1

Evidence: internal/config/defaults.go:371-376 velero service

2. Apply Configuration

# Render configuration
opencenter cluster generate my-cluster

# Commit to Git
cd ~/my-cluster-gitops
git add .
git commit -m "Enable Velero backup"
git push

# FluxCD will reconcile
flux reconcile kustomization velero-base

3. Verify Velero Installation

# Check Velero pods
kubectl get pods -n velero

# Expected output:
# NAME                      READY   STATUS    RESTARTS   AGE
# velero-7d9c4c9f9d-abcde   1/1     Running   0          5m

# Check Velero backup location
velero backup-location get

# Expected output:
# NAME      PROVIDER   BUCKET/PREFIX                PHASE       LAST VALIDATED
# default   aws        my-cluster-velero-backups    Available   2026-02-17 10:00:00

4. Create Backup Schedule

Create automated backup schedule:

# Create daily backup schedule
velero schedule create daily-backup \
  --schedule="0 3 * * *" \
  --ttl 720h0m0s  # 30 days retention

# Create weekly full backup
velero schedule create weekly-full-backup \
  --schedule="0 1 * * 0" \
  --ttl 2160h0m0s  # 90 days retention

# List schedules
velero schedule get

5. Create Manual Backup

Create manual backup for testing:

# Backup entire cluster
velero backup create manual-backup-$(date +%Y%m%d)

# Backup specific namespace
velero backup create app-backup \
  --include-namespaces my-app

# Backup with volume snapshots
velero backup create full-backup \
  --snapshot-volumes=true

# Watch backup progress
velero backup describe manual-backup-20260217 --details

# Check backup status
velero backup get

Part 3: Restore from Backup

Scenario 1: Restore etcd (Cluster State)

Use case: Cluster state corrupted, need to restore API objects

Steps:

# 1. Stop Kubernetes API server (on all control plane nodes)
ssh ubuntu@<control-plane-1>
sudo systemctl stop kube-apiserver

# 2. Download etcd backup from S3
aws s3 cp s3://my-cluster-etcd-backups/2026-02-17-02-00-00-etcd-snapshot.db.gz /tmp/
gunzip /tmp/2026-02-17-02-00-00-etcd-snapshot.db.gz

# 3. Restore etcd snapshot
sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/2026-02-17-02-00-00-etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --name=<node-name> \
  --initial-cluster=<cluster-config> \
  --initial-advertise-peer-urls=<peer-url>

# 4. Replace etcd data directory
sudo systemctl stop etcd
sudo mv /var/lib/etcd /var/lib/etcd.backup
sudo mv /var/lib/etcd-restore /var/lib/etcd
sudo systemctl start etcd

# 5. Start Kubernetes API server
sudo systemctl start kube-apiserver

# 6. Verify cluster state
kubectl get nodes
kubectl get pods -A

⚠️ WARNING\ etcd restore is a destructive operation. Test in non-production first.

Scenario 2: Restore Application (Velero)

Use case: Application deleted, need to restore resources and data

Steps:

# 1. List available backups
velero backup get

# 2. Restore from backup
velero restore create --from-backup manual-backup-20260217

# 3. Watch restore progress
velero restore describe manual-backup-20260217-restore --details

# 4. Verify restored resources
kubectl get all -n my-app

# 5. Verify persistent volumes
kubectl get pvc -n my-app
kubectl get pv

Scenario 3: Restore Specific Namespace

Use case: Single namespace deleted or corrupted

Steps:

# Restore specific namespace
velero restore create app-restore \
  --from-backup daily-backup-20260217 \
  --include-namespaces my-app

# Verify restore
velero restore describe app-restore
kubectl get all -n my-app

Scenario 4: Restore to Different Cluster

Use case: Migrate application to new cluster

Steps:

# 1. Install Velero on target cluster with same S3 configuration
# (Already done if using openCenter)

# 2. Verify backup location
velero backup-location get

# 3. List backups from source cluster
velero backup get

# 4. Restore to target cluster
velero restore create migration-restore \
  --from-backup daily-backup-20260217

# 5. Verify resources in target cluster
kubectl get all -A

Verification

Verify backup and restore capabilities:

# 1. Verify etcd backup schedule
kubectl get cronjob -n kube-system etcd-backup

# 2. Verify recent etcd backups
aws s3 ls s3://my-cluster-etcd-backups/ | tail -5

# 3. Verify Velero installation
velero version

# 4. Verify Velero backup location
velero backup-location get

# 5. Verify Velero schedules
velero schedule get

# 6. Verify recent Velero backups
velero backup get

# 7. Test restore (in dev/staging)
velero restore create test-restore --from-backup <backup-name>

Troubleshooting

etcd Backup Fails

Symptom: etcd backup CronJob fails

Diagnosis:

# Check CronJob logs
kubectl logs -n kube-system job/etcd-backup-<timestamp>

# Common errors:
# - S3 authentication failed
# - S3 bucket doesn't exist
# - Insufficient permissions

Solution:

# Verify S3 credentials
aws s3 ls s3://my-cluster-etcd-backups/ \
  --profile my-cluster

# Create S3 bucket if missing
aws s3 mb s3://my-cluster-etcd-backups

# Update S3 credentials in configuration
opencenter cluster edit my-cluster

Velero Backup Fails

Symptom: Velero backup stuck in InProgress or Failed

Diagnosis:

# Check backup status
velero backup describe <backup-name> --details

# Check Velero logs
kubectl logs -n velero deployment/velero

Common causes:

S3 authentication failed: Invalid credentials
Volume snapshot failed: Provider plugin not configured
Resource too large: Backup timeout

Solution:

# Fix S3 credentials
kubectl edit secret -n velero cloud-credentials

# Install volume snapshot plugin
velero plugin add velero/velero-plugin-for-aws:v1.9.0

# Increase backup timeout
velero backup create large-backup --timeout 2h

Restore Fails

Symptom: Velero restore fails or incomplete

Diagnosis:

# Check restore status
velero restore describe <restore-name> --details

# Check for errors
velero restore logs <restore-name>

Common causes:

Resource conflicts: Resources already exist
PV not available: Volume snapshots not restored
Namespace not created: Target namespace missing

Solution:

# Delete conflicting resources
kubectl delete namespace my-app

# Restore with namespace mapping
velero restore create --from-backup <backup> \
  --namespace-mappings old-ns:new-ns

# Restore PVs separately
velero restore create pv-restore \
  --from-backup <backup> \
  --include-resources persistentvolumes,persistentvolumeclaims

Best Practices

Test restores regularly: Verify backups are restorable (monthly)
Multiple backup locations: Use different S3 buckets/regions for redundancy
Separate etcd and Velero backups: Different schedules and retention
Monitor backup status: Alert on backup failures
Document restore procedures: Step-by-step runbooks
Encrypt backups: Use S3 server-side encryption
Offsite backups: Store backups in different region/provider
Backup before changes: Manual backup before major changes

Backup Schedule Recommendations

Development:

etcd: Daily, 7-day retention
Velero: Daily, 7-day retention

Staging:

etcd: Daily, 14-day retention
Velero: Daily, 14-day retention

Production:

etcd: Every 6 hours, 30-day retention
Velero: Daily, 90-day retention
Velero weekly: Weekly, 1-year retention

upgrade-kubernetes.md[Upgrade Kubernetes] - Backup before upgrades
migrate-clusters.md[Migrate Clusters] - Use backups for migration
troubleshoot-deployment.md[Troubleshoot Deployment] - Restore from backup
../reference/platform-services.md[Platform Services] - etcd-backup and Velero configuration

Evidence

This guide is based on:

etcd backup service: internal/config/defaults.go:308-314
Velero service: internal/config/defaults.go:371-376
Backup configuration: Ecosystem.md infrastructure services
Velero documentation: Official Velero docs
etcd restore: Kubernetes etcd documentation

Prerequisites

Task Summary

Backup Strategy

Part 1: Configure etcd Backup

1. Edit Cluster Configuration

2. Configure etcd Backup

3. Apply Configuration

4. Verify etcd Backup

5. Test etcd Backup

Part 2: Configure Velero Backup

1. Enable Velero Service

2. Apply Configuration

3. Verify Velero Installation

4. Create Backup Schedule

5. Create Manual Backup

Part 3: Restore from Backup

Scenario 1: Restore etcd (Cluster State)

Scenario 2: Restore Application (Velero)

Scenario 3: Restore Specific Namespace

Scenario 4: Restore to Different Cluster

Verification

Troubleshooting

etcd Backup Fails

Velero Backup Fails

Restore Fails

Best Practices

Backup Schedule Recommendations

Related Topics

Evidence