Node Replacement

Purpose: For operators, shows how to drain, remove, and re-provision failed control plane or worker nodes.

Prerequisites

kubectl access to the cluster
SSH access to cluster nodes (via bastion or direct)
Kubespray inventory in infrastructure/clusters/<cluster>/inventory/
For control plane nodes: at least 3 control plane nodes (replacing one at a time maintains quorum)

Step 1: Identify the Failed Node

# Check node status
kubectl get nodes -o wide

# Look for NotReady nodes
kubectl get nodes | grep NotReady

# Check node conditions for details
kubectl describe node <node-name> | grep -A5 Conditions

Step 2: Cordon and Drain

Cordoning prevents new pods from being scheduled. Draining evicts existing workloads.

# Cordon the node (no new scheduling)
kubectl cordon <node-name>

# Drain the node (evict workloads, respect PDBs)
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

If the node is completely unreachable, drain with --force:

kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --timeout=120s

Step 3: Remove the Node from Kubernetes

# Delete the node object from the cluster
kubectl delete node <node-name>

For control plane nodes, also remove the etcd member:

# List etcd members (run on a healthy control plane node)
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

# Remove the failed member by ID
ETCDCTL_API=3 etcdctl member remove <member-id> \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

Step 4: Provision a Replacement VM

Create a new VM through Terraform or your infrastructure provider. Update the Kubespray inventory to replace the old node entry with the new one:

# infrastructure/clusters/<cluster>/inventory/inventory.yaml
all:
  hosts:
    worker-04:  # Replacement node
      ansible_host: 192.168.12.27
      ip: 192.168.12.27
  children:
    kube_node:
      hosts:
        worker-04: {}

Remove the old node entry and commit the inventory change via PR.

Step 5: Run Kubespray to Join the New Node

cd infrastructure/clusters/<cluster>/inventory/

# Add the new node using Kubespray's scale playbook
ansible-playbook -i inventory.yaml \
  -b --become-user=root \
  scale.yml \
  --limit=worker-04

For control plane replacements, use the full cluster.yml playbook instead of scale.yml.

Step 6: Verify

# Confirm the new node is Ready
kubectl get nodes -o wide

# Check pods are scheduling on the new node
kubectl get pods -A -o wide | grep worker-04

# Verify etcd health (for control plane replacements)
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/node-$(hostname).pem \
  --key=/etc/ssl/etcd/ssl/node-$(hostname)-key.pem

Troubleshooting

Drain hangs — A PodDisruptionBudget is blocking eviction. Check kubectl get pdb -A and assess whether it is safe to use --force.
New node fails to join — Verify SSH connectivity from the Kubespray runner to the new node. Check that the join token has not expired.
etcd quorum lost — If two of three control plane nodes fail simultaneously, etcd loses quorum. Restore from etcd backup (see Backup & Restore).
Persistent volumes on failed node — Longhorn replicates data across nodes. If the failed node held the only replica, data recovery depends on the storage backend.

Prerequisites​

Step 1: Identify the Failed Node​

Step 2: Cordon and Drain​

Step 3: Remove the Node from Kubernetes​

Step 4: Provision a Replacement VM​

Step 5: Run Kubespray to Join the New Node​

Step 6: Verify​

Troubleshooting​