Health Checks

Purpose: For operators, documents health check procedures for cluster components including CLI commands, kubectl checks, and Prometheus alert rules.

Prerequisites

kubectl access to the target cluster
opencenter CLI installed
flux CLI installed (for FluxCD checks)
Prometheus stack deployed (for alert-based monitoring)

Quick Health Check

Run the openCenter CLI validation to check overall cluster health:

opencenter cluster doctor <cluster-name>

This checks local tools, credentials, and provider readiness in a single pass.

Component Checks

etcd

# Check etcd member health (from a control plane node)
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# Check etcd member list and leader
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
  --write-out=table

# Check etcd database size
ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
  --write-out=table

Metric	Healthy	Warning	Critical
DB size	< 4 GB	4–6 GB	> 6 GB
Leader elections (last 1h)	0	1–2	> 2
Peer RTT	< 10 ms	10–50 ms	> 50 ms

Kubernetes API Server

# Basic API server reachability
kubectl cluster-info

# API server health endpoint
kubectl get --raw /healthz

# Detailed component status
kubectl get --raw /healthz?verbose

# Check API server response latency
kubectl get --raw /readyz

Kubelet

# Node status (all nodes should be Ready)
kubectl get nodes -o wide

# Check kubelet conditions on a specific node
kubectl describe node <node-name> | grep -A5 Conditions

# Check for NotReady nodes
kubectl get nodes | grep -v " Ready"

Condition	Expected
Ready	True
MemoryPressure	False
DiskPressure	False
PIDPressure	False
NetworkUnavailable	False

FluxCD

# Verify Flux controllers are running
kubectl get pods -n flux-system

# Check all GitRepository sources
flux get sources git -A

# Check all Kustomizations
flux get kustomizations -A

# Check all HelmReleases
flux get helmreleases -A

# Look for reconciliation failures
flux get kustomizations -A --status-selector ready=false

A healthy Flux setup shows all sources as READY=True and all kustomizations as Applied.

Platform Services

# Check all pods across namespaces (look for non-Running)
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Check specific services
kubectl get pods -n monitoring       # Prometheus, Grafana, Alertmanager
kubectl get pods -n cert-manager     # cert-manager
kubectl get pods -n velero           # Velero
kubectl get pods -n logging          # Loki
kubectl get pods -n auth             # Keycloak

# Check HelmRelease health for all services
flux get helmreleases -A --status-selector ready=false

CLI Validation Commands

Full Cluster Validation

# Validate configuration, secrets, and service readiness
opencenter cluster validate <cluster-name>

This runs schema validation, business rules, and service secret checks. Output includes:

Service reports (pass/fail per enabled service)
GitOps structure validation
Missing secrets detection
Stub/placeholder secret detection

Drift Detection

# Detect infrastructure drift
opencenter cluster drift detect <cluster-name>

# Show only critical drift
opencenter cluster drift detect <cluster-name> --severity=critical

# Output as JSON for automation
opencenter cluster drift detect <cluster-name> --output=json

Secrets Validation

# Validate SOPS key configuration
opencenter secrets keys validate

# Check key expiration
opencenter secrets keys check

Prometheus Alert Rules

The kube-prometheus-stack service deploys standard alert rules. Key alerts for cluster health:

Alert	Severity	Condition
`etcdMembersDown`	critical	Any etcd member unreachable for 3m
`etcdNoLeader`	critical	etcd cluster has no leader for 1m
`etcdHighNumberOfLeaderChanges`	warning	> 3 leader changes in 1h
`etcdDatabaseHighFragmentationRatio`	warning	Fragmentation > 50%
`KubeNodeNotReady`	warning	Node NotReady for 15m
`KubeNodeUnreachable`	warning	Node unreachable for 15m
`KubePodCrashLooping`	warning	Pod restart > 5 times in 15m
`KubePodNotReady`	warning	Pod not ready for 15m
`KubeAPIDown`	critical	API server unreachable for 5m
`KubeControllerManagerDown`	critical	Controller manager down for 5m
`KubeSchedulerDown`	critical	Scheduler down for 5m
`FluxReconciliationFailure`	warning	FluxCD reconciliation failing for 30m
`CertManagerCertExpirySoon`	warning	Certificate expires within 21 days
`CertManagerCertNotReady`	critical	Certificate not ready for 10m
`VeleroBackupFailure`	warning	Velero backup failed

Access alerts via Alertmanager:

# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# Check firing alerts via API
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'

Automated Health Probes

Liveness and Readiness

All openCenter platform services include Kubernetes liveness and readiness probes. Check probe status:

# List pods with failing probes
kubectl get events -A --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20

Scheduled Health Checks

Combine checks into a script for periodic execution:

#!/bin/bash
set -euo pipefail

CLUSTER=${1:?Usage: health-check.sh <cluster-name>}

echo "=== Node Status ==="
kubectl get nodes

echo "=== Flux Status ==="
flux get kustomizations -A --status-selector ready=false

echo "=== Failed Pods ==="
kubectl get pods -A --field-selector=status.phase=Failed

echo "=== CrashLooping Pods ==="
kubectl get pods -A | grep CrashLoopBackOff || echo "None"

echo "=== Certificate Expiry ==="
kubectl get certificates -A -o custom-columns=\
NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
READY:.status.conditions[0].status,\
EXPIRY:.status.notAfter

echo "=== Velero Backups ==="
velero backup get --output=json | jq '.items[-1] | {name: .metadata.name, phase: .status.phase, started: .status.startTimestamp}'

echo "=== Cluster Validation ==="
opencenter cluster validate "$CLUSTER"

Troubleshooting

Node NotReady — Check kubelet logs on the node: journalctl -u kubelet -f. Common causes: disk pressure, network plugin crash, or certificate expiration.
Flux reconciliation failure — Run flux logs --level=error and check the specific kustomization: kubectl describe kustomization <name> -n flux-system.
etcd high latency — Check disk I/O on control plane nodes. etcd requires low-latency storage (SSD recommended, < 10ms fsync).
API server 5xx errors — Check API server logs and etcd connectivity. May indicate etcd quorum loss or resource exhaustion.

Prerequisites​

Quick Health Check​

Component Checks​

etcd​

Kubernetes API Server​

Kubelet​

FluxCD​

Platform Services​

CLI Validation Commands​

Full Cluster Validation​

Drift Detection​

Secrets Validation​

Prometheus Alert Rules​

Automated Health Probes​

Liveness and Readiness​

Scheduled Health Checks​

Troubleshooting​