Skip to main content

Health Checks

Purpose: For operators, documents health check procedures for cluster components including CLI commands, kubectl checks, and Prometheus alert rules.

Prerequisites

  • kubectl access to the target cluster
  • opencenter CLI installed
  • flux CLI installed (for FluxCD checks)
  • Prometheus stack deployed (for alert-based monitoring)

Quick Health Check

Run the openCenter CLI validation to check overall cluster health:

opencenter cluster doctor <cluster-name>

This checks local tools, credentials, and provider readiness in a single pass.

Component Checks

etcd

# Check etcd member health (from a control plane node)
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# Check etcd member list and leader
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
--write-out=table

# Check etcd database size
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem \
--write-out=table
MetricHealthyWarningCritical
DB size< 4 GB4–6 GB> 6 GB
Leader elections (last 1h)01–2> 2
Peer RTT< 10 ms10–50 ms> 50 ms

Kubernetes API Server

# Basic API server reachability
kubectl cluster-info

# API server health endpoint
kubectl get --raw /healthz

# Detailed component status
kubectl get --raw /healthz?verbose

# Check API server response latency
kubectl get --raw /readyz

Kubelet

# Node status (all nodes should be Ready)
kubectl get nodes -o wide

# Check kubelet conditions on a specific node
kubectl describe node <node-name> | grep -A5 Conditions

# Check for NotReady nodes
kubectl get nodes | grep -v " Ready"
ConditionExpected
ReadyTrue
MemoryPressureFalse
DiskPressureFalse
PIDPressureFalse
NetworkUnavailableFalse

FluxCD

# Verify Flux controllers are running
kubectl get pods -n flux-system

# Check all GitRepository sources
flux get sources git -A

# Check all Kustomizations
flux get kustomizations -A

# Check all HelmReleases
flux get helmreleases -A

# Look for reconciliation failures
flux get kustomizations -A --status-selector ready=false

A healthy Flux setup shows all sources as READY=True and all kustomizations as Applied.

Platform Services

# Check all pods across namespaces (look for non-Running)
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Check specific services
kubectl get pods -n monitoring # Prometheus, Grafana, Alertmanager
kubectl get pods -n cert-manager # cert-manager
kubectl get pods -n velero # Velero
kubectl get pods -n logging # Loki
kubectl get pods -n auth # Keycloak

# Check HelmRelease health for all services
flux get helmreleases -A --status-selector ready=false

CLI Validation Commands

Full Cluster Validation

# Validate configuration, secrets, and service readiness
opencenter cluster validate <cluster-name>

This runs schema validation, business rules, and service secret checks. Output includes:

  • Service reports (pass/fail per enabled service)
  • GitOps structure validation
  • Missing secrets detection
  • Stub/placeholder secret detection

Drift Detection

# Detect infrastructure drift
opencenter cluster drift detect <cluster-name>

# Show only critical drift
opencenter cluster drift detect <cluster-name> --severity=critical

# Output as JSON for automation
opencenter cluster drift detect <cluster-name> --output=json

Secrets Validation

# Validate SOPS key configuration
opencenter secrets keys validate

# Check key expiration
opencenter secrets keys check

Prometheus Alert Rules

The kube-prometheus-stack service deploys standard alert rules. Key alerts for cluster health:

AlertSeverityCondition
etcdMembersDowncriticalAny etcd member unreachable for 3m
etcdNoLeadercriticaletcd cluster has no leader for 1m
etcdHighNumberOfLeaderChangeswarning> 3 leader changes in 1h
etcdDatabaseHighFragmentationRatiowarningFragmentation > 50%
KubeNodeNotReadywarningNode NotReady for 15m
KubeNodeUnreachablewarningNode unreachable for 15m
KubePodCrashLoopingwarningPod restart > 5 times in 15m
KubePodNotReadywarningPod not ready for 15m
KubeAPIDowncriticalAPI server unreachable for 5m
KubeControllerManagerDowncriticalController manager down for 5m
KubeSchedulerDowncriticalScheduler down for 5m
FluxReconciliationFailurewarningFluxCD reconciliation failing for 30m
CertManagerCertExpirySoonwarningCertificate expires within 21 days
CertManagerCertNotReadycriticalCertificate not ready for 10m
VeleroBackupFailurewarningVelero backup failed

Access alerts via Alertmanager:

# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# Check firing alerts via API
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'

Automated Health Probes

Liveness and Readiness

All openCenter platform services include Kubernetes liveness and readiness probes. Check probe status:

# List pods with failing probes
kubectl get events -A --field-selector reason=Unhealthy --sort-by='.lastTimestamp' | tail -20

Scheduled Health Checks

Combine checks into a script for periodic execution:

#!/bin/bash
set -euo pipefail

CLUSTER=${1:?Usage: health-check.sh <cluster-name>}

echo "=== Node Status ==="
kubectl get nodes

echo "=== Flux Status ==="
flux get kustomizations -A --status-selector ready=false

echo "=== Failed Pods ==="
kubectl get pods -A --field-selector=status.phase=Failed

echo "=== CrashLooping Pods ==="
kubectl get pods -A | grep CrashLoopBackOff || echo "None"

echo "=== Certificate Expiry ==="
kubectl get certificates -A -o custom-columns=\
NAMESPACE:.metadata.namespace,\
NAME:.metadata.name,\
READY:.status.conditions[0].status,\
EXPIRY:.status.notAfter

echo "=== Velero Backups ==="
velero backup get --output=json | jq '.items[-1] | {name: .metadata.name, phase: .status.phase, started: .status.startTimestamp}'

echo "=== Cluster Validation ==="
opencenter cluster validate "$CLUSTER"

Troubleshooting

  • Node NotReady — Check kubelet logs on the node: journalctl -u kubelet -f. Common causes: disk pressure, network plugin crash, or certificate expiration.
  • Flux reconciliation failure — Run flux logs --level=error and check the specific kustomization: kubectl describe kustomization <name> -n flux-system.
  • etcd high latency — Check disk I/O on control plane nodes. etcd requires low-latency storage (SSD recommended, < 10ms fsync).
  • API server 5xx errors — Check API server logs and etcd connectivity. May indicate etcd quorum loss or resource exhaustion.