Health Checks
Purpose: For operators, documents health check procedures for cluster components including CLI commands, kubectl checks, and Prometheus alert rules.
Prerequisites
kubectlaccess to the target clusteropencenterCLI installedfluxCLI installed (for FluxCD checks)- Prometheus stack deployed (for alert-based monitoring)
Quick Health Check
Run the openCenter CLI validation to check overall cluster health:
opencenter cluster doctor <cluster-name>
This checks local tools, credentials, and provider readiness in a single pass.
Component Checks
etcd
# Check etcd member health (from a control plane node)
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem
# Check etcd member list and leader
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
--write-out=table
# Check etcd database size
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
--write-out=table
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| DB size | < 4 GB | 4–6 GB | > 6 GB |
| Leader elections (last 1h) | 0 | 1–2 | > 2 |
| Peer RTT | < 10 ms | 10–50 ms | > 50 ms |
Kubernetes API Server
# Basic API server reachability
kubectl cluster-info
# API server health endpoint
kubectl get --raw /healthz
# Detailed component status
kubectl get --raw /healthz?verbose
# Check API server response latency
kubectl get --raw /readyz
Kubelet
# Node status (all nodes should be Ready)
kubectl get nodes -o wide
# Check kubelet conditions on a specific node
kubectl describe node <node-name> | grep -A5 Conditions
# Check for NotReady nodes
kubectl get nodes | grep -v " Ready"
| Condition | Expected |
|---|---|
| Ready | True |
| MemoryPressure | False |
| DiskPressure | False |
| PIDPressure | False |
| NetworkUnavailable | False |
FluxCD
# Verify Flux controllers are running
kubectl get pods -n flux-system
# Check all GitRepository sources
flux get sources git -A
# Check all Kustomizations
flux get kustomizations -A
# Check all HelmReleases
flux get helmreleases -A
# Look for reconciliation failures
flux get kustomizations -A --status-selector ready=false
A healthy Flux setup shows all sources as READY=True and all kustomizations as Applied.
Platform Services
# Check all pods across namespaces (look for non-Running)
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Check specific services
kubectl get pods -n monitoring # Prometheus, Grafana, Alertmanager
kubectl get pods -n cert-manager # cert-manager
kubectl get pods -n velero # Velero
kubectl get pods -n logging # Loki
kubectl get pods -n auth # Keycloak
# Check HelmRelease health for all services
flux get helmreleases -A --status-selector ready=false
CLI Validation Commands
Full Cluster Validation
# Validate configuration, secrets, and service readiness
opencenter cluster validate <cluster-name>
This runs schema validation, business rules, and service secret checks. Output includes:
- Service reports (pass/fail per enabled service)
- GitOps structure validation
- Missing secrets detection
- Stub/placeholder secret detection
Drift Detection
# Detect infrastructure drift
opencenter cluster drift detect <cluster-name>
# Show only critical drift
opencenter cluster drift detect <cluster-name> --severity=critical
# Output as JSON for automation
opencenter cluster drift detect <cluster-name> --output=json
Secrets Validation
# Validate SOPS key configuration
opencenter secrets keys validate
# Check key expiration
opencenter secrets keys check
Prometheus Alert Rules
The kube-prometheus-stack service deploys standard alert rules. Key alerts for cluster health:
| Alert | Severity | Condition |
|---|---|---|
etcdMembersDown | critical | Any etcd member unreachable for 3m |
etcdNoLeader | critical | etcd cluster has no leader for 1m |
etcdHighNumberOfLeaderChanges | warning | > 3 leader changes in 1h |
etcdDatabaseHighFragmentationRatio | warning | Fragmentation > 50% |
KubeNodeNotReady | warning | Node NotReady for 15m |
KubeNodeUnreachable | warning | Node unreachable for 15m |
KubePodCrashLooping | warning | Pod restart > 5 times in 15m |
KubePodNotReady | warning | Pod not ready for 15m |
KubeAPIDown | critical | API server unreachable for 5m |
KubeControllerManagerDown | critical | Controller manager down for 5m |
KubeSchedulerDown | critical | Scheduler down for 5m |
FluxReconciliationFailure | warning | FluxCD reconciliation failing for 30m |
CertManagerCertExpirySoon | warning | Certificate expires within 21 days |
CertManagerCertNotReady | critical | Certificate not ready for 10m |
VeleroBackupFailure | warning | Velero backup failed |
Access alerts via Alertmanager:
# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# Check firing alerts via API
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'
Automated Health Probes
Liveness and Readiness
All openCenter platform services include Kubernetes liveness and readiness probes. Check probe status:
# List pods with failing probes
kubectl get events -A --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20
Scheduled Health Checks
Combine checks into a script for periodic execution:
#!/bin/bash
set -euo pipefail
CLUSTER=${1:?Usage: health-check.sh <cluster-name>}
echo "=== Node Status ==="
kubectl get nodes
echo "=== Flux Status ==="
flux get kustomizations -A --status-selector ready=false
echo "=== Failed Pods ==="
kubectl get pods -A --field-selector=status.phase=Failed
echo "=== CrashLooping Pods ==="
kubectl get pods -A | grep CrashLoopBackOff || echo "None"
echo "=== Certificate Expiry ==="
kubectl get certificates -A -o custom-columns=\
NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
READY:.status.conditions[0].status,\
EXPIRY:.status.notAfter
echo "=== Velero Backups ==="
velero backup get --output=json | jq '.items[-1] | {name: .metadata.name, phase: .status.phase, started: .status.startTimestamp}'
echo "=== Cluster Validation ==="
opencenter cluster validate "$CLUSTER"
Troubleshooting
- Node NotReady — Check kubelet logs on the node:
journalctl -u kubelet -f. Common causes: disk pressure, network plugin crash, or certificate expiration. - Flux reconciliation failure — Run
flux logs --level=errorand check the specific kustomization:kubectl describe kustomization <name> -n flux-system. - etcd high latency — Check disk I/O on control plane nodes. etcd requires low-latency storage (SSD recommended, < 10ms fsync).
- API server 5xx errors — Check API server logs and etcd connectivity. May indicate etcd quorum loss or resource exhaustion.