You have a 200-node production cluster running Kubernetes 1.28. You need to upgrade to 1.30. The cluster hosts critical workloads: databases, message queues, real-time APIs. You have 4 weeks and a 6-hour weekly maintenance window. Walk through your full upgrade strategy.
Kubernetes upgrades are sequential and require careful orchestration. This is a 3-week effort minimum:
Week 1: Control Plane
1. Backup etcd first:
etcdctl snapshot save /backup/pre-upgrade-1.28.db
2. Verify addon compatibility (CNI, ingress, storage plugins support 1.30)
3. Upgrade one control plane node at a time (maintain HA during upgrade):
# On first master node:
apt-get update && apt-get install kubeadm=1.29.x-00 kubectl=1.29.x-00
Drain node (no pods running on control nodes anyway)
kubeadm upgrade plan
Test upgrade (dry run)
kubeadm upgrade apply v1.29.x --dry-run
Actually upgrade
kubeadm upgrade apply v1.29.x
Upgrade kubelet
apt-get install kubelet=1.29.x-00
systemctl restart kubelet
Verify healthy
kubectl get nodes
kubectl get cs
4. Repeat for remaining master nodes. Keep at least one master up during each upgrade.
5. After all 3 masters on 1.29, upgrade to 1.30 using same process
Week 2-3: Worker Nodes (Cordoned Rolling Upgrade)
6. Create a maintenance window. For 200 nodes, break into 5 groups of 40 nodes per maintenance window:
# Group 1: Nodes 1-40
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data --grace-period=30
Upgrade on node
ssh node-1
apt-get update && apt-get install kubeadm=1.29.x-00 kubelet=1.29.x-00
kubeadm upgrade node
systemctl restart kubelet
Uncordon
kubectl uncordon node-1
Verify ready
kubectl get nodes node-1
7. Monitor workload distribution during each drain. Verify PDBs allow safe eviction:
kubectl get pdb -A
Ensure minAvailable/maxUnavailable are set for critical workloads
8. Between groups, verify cluster metrics: CPU, memory, request latency. If any spike, halt and investigate.
Rollback Plan: If catastrophic issue detected in first week (e.g., API incompatibility), restore etcd backup from 1.28 and downgrade. Rollback becomes impossible after week 2.
Follow-up: How would you detect if a third-party addon (e.g., service mesh, CNI) is incompatible with 1.30 BEFORE you upgrade the whole cluster?
During your control plane upgrade from 1.28 to 1.29, the APIServer fails to start on the first master. It crashes with "unknown field: spec.rules[0].priority" in a CustomResourceDefinition. Your monitoring system can't connect to the API. You have 30 minutes before your maintenance window expires. What do you do?
This is a data incompatibility issue. The new APIServer version doesn't recognize a field from an old CRD:
1. Stay calm. You still have quorum on the other 2 masters (they're still on 1.28). Immediately revert this node:
apt-get install kubeadm=1.28.x-00 kubectl=1.28.x-00
kubeadm upgrade revert
systemctl restart kubelet
2. While that node is recovering, investigate the CRD issue from a healthy master:
kubectl get crd -o json | jq '.items[] | select(.metadata.name | contains("priority")) | .spec'
3. Identify the problematic CRD owner (likely a custom controller or operator):
kubectl describe crd xxx-crd
# Check labels and annotations for ownership
4. Before retrying upgrade, verify the CRD is compatible with 1.29. Check the operator/controller documentation or upgrade it first:
kubectl set image deployment/my-controller-manager controller=my-controller:v2.1-k8s1.29 -n kube-system
5. Update the CRD to remove deprecated fields (if you own it):
kubectl edit crd xxx-crd
# Remove spec.rules[0].priority if it's deprecated
6. Retry the control plane upgrade after CRD fix
Prevention: Before any upgrade, check breaking changes in release notes and test custom controllers/CRDs in a staging cluster running the new version.
Follow-up: How would you automate the detection of incompatible CRDs or controllers before upgrade? Design a pre-flight check.
You're upgrading nodes and running kubectl drain. The drain hangs for 40 minutes on a single node. You see a DaemonSet pod stuck in Terminating state. kubectl logs shows the container's preStop hook is trying to gracefully drain connections from a connection pool, but it's deadlocked. You're blocking 50 other nodes from upgrading. What's your decision?
This is the classic drain timeout problem. You have two options:
Option 1: Safe Path (Recommended)
1. Increase the grace period and let the container shut down:
kubectl drain node-x --grace-period=120 --timeout=300s
2. Check if the preStop hook is actually deadlocked or just slow:
kubectl describe pod daemonset-pod -n default
# Check events for termination messages
3. If still hanging after 5 minutes, kill just that pod (not drain):
kubectl delete pod daemonset-pod -n default --grace-period=0 --force
4. Now retry drain:
kubectl drain node-x --ignore-daemonsets --delete-emptydir-data
Option 2: Force Path (High Risk)
If you absolutely must upgrade immediately (security patch, etc.):
kubectl drain node-x --ignore-daemonsets --delete-emptydir-data --force --grace-period=0
This force-deletes the stuck pod immediately. Risk: If the preStop was flushing data or connections, you may lose in-flight transactions.
Best Practice Fix:
1. Fix the preStop hook in the DaemonSet to have a timeout:
spec:
template:
spec:
containers:
- lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "timeout 20 /graceful-shutdown.sh || echo 'force shutdown'"]
timeoutSeconds: 30
2. Set terminationGracePeriodSeconds reasonably (30-60s for normal apps):
terminationGracePeriodSeconds: 60
3. Monitor drain operations: set drain timeout based on your app's worst-case shutdown time, not infinite waits.
Follow-up: Design a drain safety check that prevents upgrades if any pod's preStop hook timeout exceeds X seconds. How would you enforce this?
After upgrading 50 nodes to 1.29, you notice latency spike on your API servers (p99 went from 50ms to 800ms). The new kubelet version on upgraded nodes is implementing stricter resource accounting. Pods that were using 512MB are now hitting the 512MB limit and getting throttled. You can't add more nodes quickly. Walk through diagnosis and mitigation.
This is a version-specific resource accounting change. Kubelet 1.29 fixed memory accounting bugs that gave older versions inflated capacity:
1. Confirm the issue by checking the upgraded nodes' allocatable capacity:
kubectl get nodes -o json | jq '.items[] | select(.status.nodeInfo.kubeletVersion | contains("1.29")) | {name:.metadata.name, allocatable:.status.allocatable}'
2. Compare against 1.28 nodes:
kubectl get nodes -o json | jq '.items[] | select(.status.nodeInfo.kubeletVersion | contains("1.28")) | {name:.metadata.name, allocatable:.status.allocatable}'
3. Check for OOMKilled pods on new nodes:
kubectl get pods -A --sort-by=.status.containerStatuses[0].lastState.terminated.reason | grep OOMKilled
4. Quick mitigation (increase limits on affected workloads):
# For critical deployments, bump memory request/limit by 10-20%
kubectl patch deployment web-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"768Mi"},"requests":{"memory":"512Mi"}}}]}}}}'
5. Pause the upgrade until you've adjusted resource limits on all workloads
6. Run a load test to verify new limits are sufficient
7. Consider downgrading one node back to 1.28 and comparing metrics side-by-side to understand the exact capacity difference
Long-term: This is a version migration cost. Plan resource budget adjustments during pre-production testing. Many teams need to add 10-15% capacity when upgrading Kubernetes due to these fixes.
Follow-up: How would you detect breaking changes in resource accounting across Kubernetes versions before upgrading production?
You're upgrading 200 nodes across 4 maintenance windows. After window 2 (100 nodes upgraded), you discover a CNI plugin version incompatibility on the new nodes. Network performance is degraded. You still have 100 nodes to upgrade. Do you continue or rollback? Design your rollback strategy.
This is a critical go/no-go decision. First, assess the impact:
1. Check if the issue is isolatable to the upgraded nodes:
kubectl get nodes -L kubernetes.io/hostname | grep -E "upgraded|v1.29"
# Run network performance test on upgraded vs. non-upgraded nodes
kubectl run perf-test --image=gcr.io/cloud-builders/kubectl -- /bin/bash -c "time nc -zv other-pod:80"
2. If impact is < 5% latency increase and affecting < 10% of traffic, you can continue with a fix. If > 20% latency, rollback immediately.
Rollback Strategy (if proceeding is too risky):
1. Cordon all upgraded nodes (prevent new workload scheduling):
kubectl cordon -l kubernetes.io/hostname=node-1,node-2,...,node-50
2. Drain upgraded nodes (move workloads back to 1.28 nodes):
for node in $(kubectl get nodes -l kubernetes.io/hostname=node-1,... -o name); do
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=30
done
3. On each upgraded node, downgrade:
ssh node-1
kubeadm upgrade revert
apt-get install kubeadm=1.28.x-00 kubelet=1.28.x-00
systemctl restart kubelet
4. Uncordon:
kubectl uncordon -l kubernetes.io/hostname=node-1,...
5. Verify cluster health and reschedule workloads
Investigation After Rollback:
Fix the CNI plugin version, then try again in 2 weeks after testing in staging
Prevention: Test the exact combination of Kubernetes + CNI versions in staging before production upgrade. Run 48-hour soak test.
Follow-up: Design an automated compatibility matrix that prevents upgrades of Kubernetes if the current CNI/storage/ingress versions are incompatible.
You're upgrading a 200-node cluster, and you've successfully upgraded 150 nodes to 1.29. The remaining 50 nodes are older hardware from 2021 that the previous team never removed. These nodes have very low usage but old CPUs that don't support some 1.29 kernel features. Do you upgrade them, skip them, or decommission them?
This is a real scenario—legacy hardware is a common blocker. Analyze first:
1. Check what's actually running on these old nodes:
kubectl describe node old-node-1 | grep -A 20 "Allocated resources"
kubectl get pods --field-selector spec.nodeName=old-node-1 -A
2. Test if 1.29 actually runs on this hardware. Try upgrading one node as a test:
ssh old-node-1
apt-get install kubeadm=1.29.x-00 kubelet=1.29.x-00
systemctl restart kubelet
# Monitor for errors
journalctl -u kubelet -n 50
3. If it works (even with warnings), continue the upgrade
4. If it fails (CPU feature set missing, kernel too old):
# Option A: Check if you can skip 1.29 and jump to future minor release
# But Kubernetes only supports skipping within a major version
# So you're stuck with 1.29
Option B: Plan to decommission these nodes
Move workloads to upgraded nodes:
kubectl drain old-node-1 --ignore-daemonsets --delete-emptydir-data
Mark nodes as NotReady in your infrastructure code and remove them
terraform destroy -target aws_instance.old_hardware
5. If you must keep them (cost, slow replacement):
- Keep them on 1.28 permanently (apply security patches but don’t upgrade Kubernetes)
- Taint them so only specific workloads run there:
kubectl taint node old-node-1 hardware=legacy:NoSchedule
- Add tolerations to workloads that must run on old hardware
6. Document the decision: "Old hardware on 1.28, plan decommission by [date]"
Long-term Fix: Include hardware compatibility checks in your upgrade runbook. Audit hardware before committing to version upgrades.
Follow-up: Design a hardware lifecycle management process that prevents the accumulation of un-upgradeable legacy nodes.
You've completed the 1.28 to 1.30 upgrade on your 200-node cluster. Everything looks good, but 2 weeks later, you get 500 alerts: some workloads are reporting "API Server Audit Log truncated" and "Event series truncated." Your audit logs are missing records. What went wrong and how do you fix it?
Audit log truncation after upgrade usually means the audit policy changed or the backend storage is too small:
1. Check the APIServer audit policy version:
kubectl get --raw /api/v1/audit
# Or check the audit policy file on the APIServer
cat /etc/kubernetes/policies/audit-policy.yaml | grep apiVersion
2. Verify the audit backend hasn't changed. Check where audit logs are being written:
kubectl get events -A --sort-by='.lastTimestamp' | head -20
# If events are sparse or showing "truncated series", events are being dropped
3. Check the APIServer logs for audit volume warnings:
kubectl logs -n kube-system kube-apiserver-master-1 | grep -i "audit\|truncate"
4. If using file-based audit logging, check disk space:
df -h /var/log/kubernetes/audit/
5. If using external audit (webhook), check the backend:
kubectl describe cm audit-policy -n kube-system
# Verify webhook endpoint is responding
curl -X POST https://audit-webhook:443/audit -H "Content-Type: application/json" -d '{}'
6. Common fix: Audit policy changed in 1.30 to be more strict. Update the policy to match your needs:
kubectl edit cm audit-policy -n kube-system
# Adjust the rules to log fewer events or batch them
# Example: reduce logging for low-value events (pods/exec, pods/logs, endpoints)
7. Restart APIServer to apply new policy:
kubectl delete pod kube-apiserver-master-1 -n kube-system
# APIServer pod will be restarted by kubelet static pod manager
8. Monitor audit log volume:
kubectl logs -n kube-system kube-apiserver-master-1 | grep -c "audit event"
Prevention: Test audit log behavior changes before production upgrades. Configure audit log volume based on expected event rate, not "whatever fits on disk."
Follow-up: Design a pre-upgrade audit log capacity check that predicts if the new version's audit policy will exceed your storage capacity.
You're at the end of your 1.28 to 1.30 upgrade. All nodes are upgraded, but when you run kubectl version, the client reports "preferred API version mismatch" on some resources (StorageClasses, Ingresses). Old manifests still work, but new deployments created post-upgrade use a different API version. Your GitOps system is confused. How do you reconcile this?
This is API version deprecation during upgrade. Different Kubernetes releases prefer different API versions (e.g., storage.k8s.io/v1beta1 → v1):
1. Check the actual API version being stored in etcd vs. what's returned:
kubectl get storageclass -o json | jq '.items[] | {name:.metadata.name, apiVersion:.apiVersion}'
# Compare with what etcd has
etcdctl get /registry/storageclasses/default --prefix | jq '.apiVersion'
2. Verify the APIServer conversion webhook is working (if using CRDs or custom versions):
kubectl api-resources | grep -E "storage|ingress"
# Should show both old and new versions if conversions are enabled
3. Migrate manifests to use the new preferred version:
# Export all resources in new format
kubectl api-resources --verbs=list --namespaced -o wide | awk '{print $1}' | xargs -I {} kubectl get {} -A -o yaml > all-resources-new-version.yaml
4. Update your GitOps repo to use new API versions. Edit all manifests that reference old versions:
# Find manifests using old versions
grep -r "apiVersion: storage.k8s.io/v1beta1" manifests/
# Update to new version
sed -i 's/storage.k8s.io\/v1beta1/storage.k8s.io\/v1/g' manifests/*.yaml
5. Re-apply manifests with kubectl apply (this triggers update and re-stores in new version):
kubectl apply -f manifests/
# Verify objects now use new version
kubectl get storageclass -o yaml | head -5
6. If GitOps tool (ArgoCD, Flux) is confused, resync:
argocd app sync default --force
# Or trigger Flux reconciliation
flux reconcile source git default
Prevention: Before upgrading, identify all resources using deprecated API versions and plan their migration.
Follow-up: Design an automated migration pipeline that converts all manifests from old API versions to new ones before the upgrade, ensuring GitOps never sees version mismatches.