Skip to main content

Resize Control Plane

Purpose: For operators, shows how to change CPU/memory for control plane VMs with rolling update procedure.

Prerequisites

  • At least 3 control plane nodes (to maintain quorum during rolling restarts)
  • Terraform configuration for the cluster infrastructure
  • SSH access to control plane nodes
  • A Velero backup or etcd snapshot taken before starting

When to Resize

Resize control plane nodes when:

  • API server response times increase under load
  • etcd latency exceeds 100ms (check via Prometheus: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])))
  • Cluster grows beyond the original sizing (more nodes, more pods, more CRDs)
  • Monitoring shows CPU or memory pressure on control plane nodes

Step 1: Update Terraform Configuration

Change the VM flavor or resource allocation:

# infrastructure/clusters/<cluster>/main.tf
# OpenStack example
resource "openstack_compute_instance_v2" "control_plane" {
count = 3
name = "cp-${count.index + 1}"
flavor_name = "m1.xlarge" # Changed from m1.large
# ...
}

For VMware:

resource "vsphere_virtual_machine" "control_plane" {
count = 3
name = "cp-${count.index + 1}"
num_cpus = 8 # Changed from 4
memory = 32768 # 32 GB, changed from 16384
# ...
}

Commit the change to a branch and open a PR.

Step 2: Rolling Resize (One Node at a Time)

Resize control plane nodes one at a time to maintain etcd quorum and API server availability. Never resize more than one control plane node simultaneously.

For each control plane node:

# 1. Cordon the node
kubectl cordon cp-1

# 2. Drain workloads (control plane nodes typically run only system pods)
kubectl drain cp-1 --ignore-daemonsets --delete-emptydir-data --timeout=300s

# 3. Shut down the VM and apply the resize
# OpenStack:
openstack server resize <server-id> --flavor m1.xlarge
openstack server resize confirm <server-id>

# VMware: Terraform apply targets the specific node
terraform apply -target='vsphere_virtual_machine.control_plane[0]'

# 4. Wait for the node to come back
kubectl get nodes -w # Watch until cp-1 shows Ready

# 5. Uncordon the node
kubectl uncordon cp-1

# 6. Verify etcd health before proceeding to the next node
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-cp-1.pem \
--key=/etc/ssl/etcd/ssl/node-cp-1-key.pem

Repeat for cp-2 and cp-3, waiting for each node to fully rejoin before proceeding.

Step 3: Verify

# All control plane nodes show Ready with new resources
kubectl get nodes -o wide

# Check allocatable resources reflect the new sizing
kubectl describe node cp-1 | grep -A5 "Allocatable"

# Verify etcd cluster is healthy (all members)
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.12.20:2379,https://192.168.12.21:2379,https://192.168.12.22:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/admin-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/admin-$(hostname)-key.pem

# Confirm API server is responsive
kubectl get --raw /healthz

# Check FluxCD reconciliation resumed normally
flux get kustomizations -A

Troubleshooting

  • Node does not come back after resize — SSH to the node and check kubelet: journalctl -u kubelet -f. The resize may have changed the network interface name or IP.
  • etcd member unhealthy after restart — Check etcd logs: journalctl -u etcd -f. Disk I/O may be slow during first boot after resize. Wait 2-3 minutes.
  • API server errors during resize — Expected if only 2 of 3 control plane nodes are available. Clients with retries will recover. Verify kube-vip or load balancer is routing to healthy nodes.