SRE / Operator Learning Path
Purpose: For SREs and operators, provides a guided reading order focused on day-2 operations, monitoring, upgrades, and incident response.
Reading Order
| # | Phase | Topic | Link | Time |
|---|---|---|---|---|
| 1 | Foundations | Architecture overview | Architecture | 10 min |
| 2 | Foundations | Platform services (20+ services, versions) | Service Catalog | 10 min |
| 3 | Foundations | CLI commands reference | CLI Commands | 10 min |
| 4 | Observability | Monitoring stack (Prometheus + Grafana + Alertmanager) | Stack Overview | 10 min |
| 5 | Observability | Loki for logs | Loki | 10 min |
| 6 | Observability | Tempo for traces | Tempo | 10 min |
| 7 | Observability | Dashboards & alerts | Dashboards | 15 min |
| 8 | Operations | Day 2 overview | Day 2 | 10 min |
| 9 | Operations | Health checks (opencenter cluster doctor) | Health | 10 min |
| 10 | Operations | Drift detection (opencenter cluster drift detect) | Drift | 10 min |
| 11 | Upgrades | Kubernetes upgrades | K8s Upgrades | 15 min |
| 12 | Upgrades | Service upgrades (gitops-base tag pinning) | Service Upgrades | 10 min |
| 13 | Reliability | Backup & restore with Velero | Backup | 15 min |
| 14 | Reliability | Disaster recovery | DR | 10 min |
| 15 | Scaling | Add worker pools (--server-pool flag) | Workers | 10 min |
| 16 | Scaling | Node replacement | Replace | 10 min |
| 17 | Secrets | Key lifecycle (check, rotate, sync, validate) | Key Rotation | 10 min |
| 18 | Troubleshooting | FluxCD reconciliation issues | FluxCD | 10 min |
| 19 | Troubleshooting | Networking issues | Network | 10 min |
| 20 | Troubleshooting | CLI errors | CLI Errors | 10 min |
Daily Operations CLI Commands
# Health & status
opencenter cluster status my-cluster
opencenter cluster doctor my-cluster
opencenter cluster drift detect my-cluster
# Secrets lifecycle
opencenter secrets keys check # Shows days until expiration
opencenter secrets validate my-cluster # Detect drift
opencenter secrets sync my-cluster # Re-encrypt after changes
# Service management
opencenter cluster service status # All service states
opencenter cluster service enable <svc> # Enable a service
opencenter cluster service disable <svc> # Disable a service
# Backup
opencenter cluster backup create my-cluster
opencenter cluster backup restore <id>
# FluxCD
flux get kustomizations # Check reconciliation
flux reconcile source git flux-system # Force source refresh
flux reconcile kustomization <name> # Force kustomization apply
Observability Stack (from openCenter-gitops-base)
| Service | Version | Purpose |
|---|---|---|
| kube-prometheus-stack | 77.6.0 | Prometheus, Grafana, Alertmanager |
| Loki | 6.45.2 | Log aggregation |
| Mimir | 6.0.3 | Long-term metrics storage |
| Tempo | 1.55.0 | Distributed tracing |
| OpenTelemetry | 0.11.1 | Telemetry collection pipeline |
Runbook Index
After completing this path, familiarize yourself with the Runbooks for standardized incident response procedures.