Kubernetes Upgrades

Purpose: For operators, shows how to perform Kubernetes upgrades with pre-flight checks, rolling updates, and validation.

Prerequisites

SSH access to cluster nodes (via bastion or direct)
Current kubeconfig for the target cluster
Kubespray inventory in infrastructure/clusters/<cluster>/inventory/
A Velero backup taken before starting (see Backup & Restore)

Pre-Flight Checks

Before upgrading, verify the cluster is healthy and the target version is supported:

# Check current Kubernetes version
kubectl get nodes -o wide

# Verify all nodes are Ready
kubectl get nodes

# Check for pending PodDisruptionBudgets that could block drains
kubectl get pdb -A

# Confirm no ongoing FluxCD reconciliation failures
flux get kustomizations -A

Kubernetes supports upgrading one minor version at a time (e.g., 1.28 → 1.29, not 1.28 → 1.30). If you need to jump multiple versions, plan sequential upgrades.

Update the Kubespray Version Variable

Edit the Kubespray inventory to set the target Kubernetes version:

# infrastructure/clusters/<cluster>/inventory/group_vars/k8s_cluster/k8s-cluster.yml
kube_version: v1.30.2

Commit this change to a branch and open a PR:

git checkout -b upgrade/k8s-v1.30.2
git add infrastructure/clusters/<cluster>/inventory/group_vars/k8s_cluster/k8s-cluster.yml
git commit -m "chore: upgrade Kubernetes to v1.30.2"
git push origin upgrade/k8s-v1.30.2

After PR approval and merge, proceed with the Kubespray run.

Run the Upgrade

Kubespray performs a rolling upgrade — control plane nodes first, then workers. Each node is cordoned, drained, upgraded, and uncordoned automatically.

cd infrastructure/clusters/<cluster>/inventory/

ansible-playbook -i inventory.yaml \
  -b --become-user=root \
  cluster.yml \
  --tags=upgrade \
  -e kube_version=v1.30.2

For large clusters, you can limit the upgrade to control plane first, then workers:

# Control plane only
ansible-playbook -i inventory.yaml \
  -b --become-user=root \
  cluster.yml \
  --tags=upgrade \
  --limit=kube_control_plane

# Workers only (after control plane is confirmed healthy)
ansible-playbook -i inventory.yaml \
  -b --become-user=root \
  cluster.yml \
  --tags=upgrade \
  --limit=kube_node

Post-Upgrade Validation

# Verify all nodes report the new version
kubectl get nodes -o wide

# Check system pods are running
kubectl get pods -n kube-system

# Verify etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# Confirm FluxCD is reconciling normally
flux get kustomizations -A

# Run drift detection to confirm no unexpected changes
opencenter cluster drift <cluster-name>

Troubleshooting

Node stuck in NotReady — SSH to the node and check kubelet logs: journalctl -u kubelet -f. Common cause: container runtime version incompatibility.
Drain timeout — A PodDisruptionBudget may be blocking eviction. Check kubectl get pdb -A and adjust if safe.
etcd issues after upgrade — Verify etcd member list: etcdctl member list. If a member is unhealthy, check disk I/O and network connectivity.
API deprecations — Run kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis to find workloads using removed APIs before upgrading.

Prerequisites​

Pre-Flight Checks​

Update the Kubespray Version Variable​

Run the Upgrade​

Post-Upgrade Validation​

Troubleshooting​