Resize Control Plane
Purpose: For operators, shows how to change CPU/memory for control plane VMs with rolling update procedure.
Prerequisites
- At least 3 control plane nodes (to maintain quorum during rolling restarts)
- Terraform configuration for the cluster infrastructure
- SSH access to control plane nodes
- A Velero backup or etcd snapshot taken before starting
When to Resize
Resize control plane nodes when:
- API server response times increase under load
- etcd latency exceeds 100ms (check via Prometheus:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))) - Cluster grows beyond the original sizing (more nodes, more pods, more CRDs)
- Monitoring shows CPU or memory pressure on control plane nodes
Step 1: Update Terraform Configuration
Change the VM flavor or resource allocation:
# infrastructure/clusters/<cluster>/main.tf
# OpenStack example
resource "openstack_compute_instance_v2" "control_plane" {
count = 3
name = "cp-${count.index + 1}"
flavor_name = "m1.xlarge" # Changed from m1.large
# ...
}
For VMware:
resource "vsphere_virtual_machine" "control_plane" {
count = 3
name = "cp-${count.index + 1}"
num_cpus = 8 # Changed from 4
memory = 32768 # 32 GB, changed from 16384
# ...
}
Commit the change to a branch and open a PR.
Step 2: Rolling Resize (One Node at a Time)
Resize control plane nodes one at a time to maintain etcd quorum and API server availability. Never resize more than one control plane node simultaneously.
For each control plane node:
# 1. Cordon the node
kubectl cordon cp-1
# 2. Drain workloads (control plane nodes typically run only system pods)
kubectl drain cp-1 --ignore-daemonsets --delete-emptydir-data --timeout=300s
# 3. Shut down the VM and apply the resize
# OpenStack:
openstack server resize <server-id> --flavor m1.xlarge
openstack server resize confirm <server-id>
# VMware: Terraform apply targets the specific node
terraform apply -target='vsphere_virtual_machine.control_plane[0]'
# 4. Wait for the node to come back
kubectl get nodes -w # Watch until cp-1 shows Ready
# 5. Uncordon the node
kubectl uncordon cp-1
# 6. Verify etcd health before proceeding to the next node
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/node-cp-1.pem \
--key=/etc/ssl/etcd/ssl/node-cp-1-key.pem
Repeat for cp-2 and cp-3, waiting for each node to fully rejoin before proceeding.
Step 3: Verify
# All control plane nodes show Ready with new resources
kubectl get nodes -o wide
# Check allocatable resources reflect the new sizing
kubectl describe node cp-1 | grep -A5 "Allocatable"
# Verify etcd cluster is healthy (all members)
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.12.20:2379,https://192.168.12.21:2379,https://192.168.12.22:2379 \
--cacert=/etc/ssl/etcd/ssl/ca.pem \
--cert=/etc/ssl/etcd/ssl/admin-$(hostname).pem \
--key=/etc/ssl/etcd/ssl/admin-$(hostname)-key.pem
# Confirm API server is responsive
kubectl get --raw /healthz
# Check FluxCD reconciliation resumed normally
flux get kustomizations -A
Troubleshooting
- Node does not come back after resize — SSH to the node and check kubelet:
journalctl -u kubelet -f. The resize may have changed the network interface name or IP. - etcd member unhealthy after restart — Check etcd logs:
journalctl -u etcd -f. Disk I/O may be slow during first boot after resize. Wait 2-3 minutes. - API server errors during resize — Expected if only 2 of 3 control plane nodes are available. Clients with retries will recover. Verify kube-vip or load balancer is routing to healthy nodes.