Production alert: 30 nodes are flapping between Ready and NotReady every 90 seconds. Kubelet metrics show garbage collection pauses of 40+ seconds. Workloads are constantly evicted and rescheduled. You have 10 minutes to stabilize.
Node flapping indicates the kubelet is becoming unresponsive (missing heartbeats to controller manager). Controller manager marks node NotReady after 40 seconds of no heartbeat (default node-monitor-grace-period=40s), then waits node-monitor-period=5s before retrying. If kubelet pauses >40s, eviction starts immediately.
Root cause: garbage collection pause. Check kubelet logs: journalctl -u kubelet -n 200 | grep "duration=.*s". If you see ImageGCManager.Start: duration=45.2s, that's the culprit. Kubelet is scanning all images on disk, calculating sizes, and evicting old ones. On a node with 1000+ images, this is expensive.
Emergency fix: SSH to a node and restart kubelet with reduced GC aggressiveness: --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-gc-interval=5m (default 2m). Also disable pod GC during this incident: --maximum-dead-containers-per-node=0 (temporarily). Kill zombie containers: docker ps -a | grep Exited | awk '{print $1}' | xargs docker rm. This will reduce GC work by 80%.
Prevent recurrence: Audit image pulls—if nodes have >500 images, there's a leak in your image management (test containers, failed deployments). Clean up old images: docker image prune -a --filter "until=72h" --force on each node. For permanent solution: enable container runtime image GC (containerd: gc = true in config.toml, set gc_policy to aggressive). Set kubelet flags: --eviction-hard=nodefs.available<5Gi,memory.available<100Mi --eviction-soft=nodefs.available<10Gi,memory.available<500Mi --eviction-soft-grace-period=nodefs.available=2m,memory.available=2m to evict gracefully before hitting hard limits.
Follow-up: After fixing GC, nodes are stable but pod startup latency increased from 2s to 8s. What changed and how would you optimize it?
You're debugging a node that's stuck in NotReady status for 30 minutes even though kubelet is running and CPU/memory look normal. The node won't accept new pods. kubectl describe node shows no status conditions. Running kubectl get nodes doesn't list the node at all sometimes, then it reappears.
The node is partitioned from API server—kubelet can't reach API server to update its Node status. Kubelet runs a heartbeat loop: post node status every --node-status-update-frequency=10s (default). If this fails 3 times in a row, kubelet evicts pods. Controller manager marks node NotReady after node-monitor-grace-period=40s of no heartbeat.
Diagnosis: SSH to node and check API connectivity: curl -k https://$(grep server /etc/kubernetes/kubelet.conf | grep https | cut -d'/' -f3):6443/api/v1/nodes. If connection refused or timeout, API server unreachable. Check kube-apiserver status: kubectl get nodes from your client—if most nodes listed, API server is up. Then the partition is on this specific node.
Check kubelet logs for errors: journalctl -u kubelet -n 500 | grep -i "error\|unable\|refused". Look for: (1) unable to sync pod "..."—pod update failed, retry queued. (2) Node controller is taint/untaint our pods—controller manager is marking node as unschedulable. (3) Certificate expired: openssl x509 -in /var/lib/kubelet/pki/kubelet-server-current.pem -noout -dates—if expired, kubelet auth fails.
Fix network partition: Check iptables on node: sudo iptables -L -n | grep 6443. If 6443 blocked, add rule: sudo iptables -I OUTPUT -p tcp --dport 6443 -j ACCEPT. Check DNS: nslookup kubernetes.default.svc.cluster.local from pod on this node. If resolution fails, node can't contact API server. Restart kubelet: systemctl restart kubelet forces reconnect. Monitor: journalctl -u kubelet -f should show successful node status updates within 10s.
Follow-up: Certificate is valid. API server is reachable from node (curl works). kubelet logs show no errors but node still NotReady. Where else could the heartbeat be failing?
A deployment scales from 10 to 1000 replicas. Kubelet starts falling behind: pod startup reaches 30s per pod instead of 2s. You see container image pulls are sequential even though there are 40 workers. After 30 minutes, the node becomes NotReady due to eviction pressure.
Kubelet image pull concurrency is limited. Default --max-pods=110 means kubelet spawn max 110 pod workers, but container image pulling is serialized: one pull at a time with --max-concurrent-image-pulls=1 (default). When scaling 1000 replicas, each pod waits for the previous to finish pulling image—queuing effect is 1000 pulls / 1 concurrent = 1000 seconds total.
Immediate fix: Increase --max-concurrent-image-pulls=10 (or higher based on node bandwidth). Restart kubelet: systemctl restart kubelet. Monitor image pull latency: journalctl -u kubelet -f | grep "Pulling image"—you should see 10 pulls in parallel now. Also verify image cache hit rate: if all pods use same image (e.g., myapp:latest), second pull should be instant (cached). If pulls still slow, check: docker image ls | wc -l—if >500 images, image layer download is bottlenecked by disk I/O.
For 1000 replicas: pre-warm images on node before scaling. Use image pull daemonset to cache image across cluster: kubectl create daemonset image-warmer --image=yourimage:tag --privileged --host-network to pull image on every node (happens once, cached). Then scale deployment—pulls hit cache instantly.
Long-term: Change pull policy. Default imagePullPolicy: IfNotPresent checks image locally (fast if cached). For CI builds, change to imagePullPolicy: Always only if image tag changes (use digests, not latest tag). Set kubelet eviction thresholds to prevent NotReady during heavy workloads: --eviction-soft-grace-period=pods.ephemeral-storage.eviction.kubernetes.io=2m—gives time for pulls to complete before evicting.
Follow-up: You set max-concurrent-image-pulls=10. Bandwidth is now 1 Gbps on each node. You're pulling 1GB images. How would you calculate the bottleneck and what would you tune?
Your StatefulSet has 100 replicas. After node failure, 50 pods need to reschedule. Kubelet's pod eviction logic evicts them in a burst, all 50 pods hit NotReady simultaneously. API server is hammered with pod deletion requests and becomes unresponsive. Your cluster is cascading.
Kubelet eviction is a graceless, fast operation by default. When disk/memory pressure reaches hard limit (e.g., --eviction-hard=memory.available<100Mi), kubelet immediately kills pods to free memory. If 50 pods evict at once, all generate pod deletion events to API server, causing a thundering herd.
Root cause: no eviction rate limiting. Kubelet has --eviction-pressure-transition-period=5m (time to wait before evicting after soft threshold hit), but doesn't rate-limit the eviction itself. If 50 pods must go, all go in 10ms.
Immediate mitigation: Restart kubelet with soft eviction thresholds: --eviction-soft=memory.available<500Mi,nodefs.available<10Gi --eviction-soft-grace-period=memory.available=1m,nodefs.available=1m --eviction-max-pod-grace-period=300. Soft thresholds give pods 1 minute to gracefully shutdown (SIGTERM, then SIGKILL after 5min). This queues evictions instead of burst killing.
API server protection: Enable API Priority and Fairness on API server: --feature-gates=APIPriorityAndFairness=true. Create a FlowSchema prioritizing delete requests: rules: [{ subjects: [{kind: ServiceAccount, serviceAccount: {name: kubelet}}], verbs: [delete], apiGroups: [""], resources: [pods]}] with high priority level. This prevents kubelet delete storms from starving other requests.
For StatefulSets specifically: use pod-deletion-cost annotation to guide eviction order. Annotate low-priority replicas: controller.kubernetes.io/pod-deletion-cost: "100"—kubelet evicts highest cost first. This way, stateful leader pod (replica 0) stays running longer. Long-term: implement Descheduler with cost-aware eviction policies to spread evictions over time rather than burst.
Follow-up: Eviction grace period is 1 minute but your graceful shutdown handler needs 5 minutes to drain connections. How would you design this without breaking the pod lifecycle?
You're running DaemonSet pods on 1000 nodes. During a rolling update, all 1000 pods try to start simultaneously. Kubelet on each node reports OOMKilled for the new DaemonSet pods, but old ones still running. Nodes show memory available but pods can't start.
Kubelet memory enforcement is per-pod-group, not cluster-wide. When a pod is created, kubelet reserves memory for it immediately (even before image pull). If two pods (old and new DaemonSet version) run in parallel during rolling update, kubelet double-counts memory. If node has 16GB and DaemonSet pod requests 8GB, kubelet can fit only 1, not 2.
Check kubelet resource accounting: kubectl top node --use-metrics-server | grep memory on affected node. Compare to actual usage: free -m on node. If reported usage is higher than actual, kubelet's memory reservation is stale. This happens if pod deletion isn't fully propagated to kubelet.
Diagnosis: List pods on node: kubectl get pods --all-namespaces --field-selector=spec.nodeName=nodename. Count running + terminating pods. If old DaemonSet still in Terminating state but kubelet hasn't freed its memory, new pod can't start. Check termination gracePeriod: kubectl get pods -o json | grep terminationGracePeriodSeconds—if set to 300 (5min), old pod takes 5min to terminate. During that time, new pod waits in Pending.
Fix rolling update: Use partition rolling update on DaemonSet: spec.updateStrategy.rollingUpdate.maxUnavailable=0 (default)—this means kubelet starts new pod BEFORE terminating old one on each node. This is fine if node has sufficient memory. If OOMKilled, it means node is over-committed. Solution: (1) Pre-drain node: kubectl drain node --ignore-daemonsets=true to remove workloads before rolling update, or (2) Increase node size/add nodes to cluster, or (3) Reduce pod resource requests if possible.
Prevent recurrence: Set resource quotas per namespace: kubectl create quota ns-quota --hard=requests.memory=100Gi—prevent one namespace from consuming entire cluster. Monitor kubelet allocatable: kubectl describe node | grep -A 5 "Allocated resources"—alert if >85% reserved.
Follow-up: You've drained nodes for the update. But terminating pods are stuck in Terminating state for 10 minutes. How would you debug why SIGTERM isn't being received?
Your cluster uses local SSD storage for databases. A node's kubelet reports disk pressure (nodefs.available < 10Gi). But the node has 500GB free space on the SSD. df -h shows the space is there. Kubelet is still evicting pods to free up space.
Kubelet calculates free space from a filesystem mount, not total device capacity. If the SSD is mounted under a subdirectory or the kubelet image directory (/var/lib/kubelet) is on a different mount than root, kubelet sees a different filesystem. Check where kubelet stores images: ps aux | grep kubelet | grep -o "\-\-root-dir=[^ ]*". Default is /var/lib/kubelet.
Inspect mount points: df -h /var/lib/kubelet—this shows kubelet's actual working filesystem. If it says 5GB free but df -h / shows 500GB, the SSD is mounted elsewhere. Check mount table: mount | grep var or lsblk—find where SSD (/dev/nvme0n1 or similar) is mounted.
Root cause: kubelet config points to wrong directory. Check /etc/kubernetes/kubelet-config.yaml: --root-dir=/var/lib/kubelet. If the actual SSD is mounted at /mnt/ssd, kubelet should use --root-dir=/mnt/ssd. Change it, restart kubelet: systemctl restart kubelet.
Alternatively, kubelet's disk pressure algorithm may be wrong. Kubelet runs du -s /var/lib/kubelet to calculate used space. If there are broken symlinks or huge log files, du counts them. Check what's using space: du -sh /var/lib/kubelet/* | sort -rh | head -20. Common culprits: (1) old container logs—/var/lib/docker/containers/*/ logs grow unbounded. Configure log rotation: {"log-driver": "json-file", "log-opts": {"max-size": "10m", "max-file": "3"}} in /etc/docker/daemon.json. (2) image layers not cleaned up after pod eviction.
Long-term: Set kubelet eviction thresholds carefully: --eviction-hard=nodefs.available=5% (percentage-based) instead of absolute GB—this scales with node size. Monitor: kubelet_node_config_allocatable{resource_name="ephemeral-storage"} metric to track kubelet's perceived available space.
Follow-up: You migrated kubelet root-dir to the SSD. Two weeks later, the SSD is still full even though no pods use much space. What's accumulating and how would you diagnose it?
Your cluster spans 3 availability zones. After AZ2 is cordoned for maintenance, nodes there start rapidly evicting pods. Kubelet logs show "triggering out-of-memory-killer" but node memory is 90% free. Pods are crashing immediately after starting.
This is a cgroup memory enforcement bug or swap misconfiguration. Kubelet creates cgroups for each pod with memory limits: --memory=5Gi becomes a cgroup v1 memory.limit_in_bytes constraint. If kubelet's calculation is wrong or swap is enabled, the OOM killer fires even with free host memory.
Check node cgroup setup: cat /proc/cgroups | grep memory—should show cgroup version. Modern systems use cgroup v2 (if systemd is using v2). Check kubelet version: kubelet --version—older versions (pre-1.25) had cgroup v2 bugs. If on cgroup v1, check memory.limit_in_bytes: cat /sys/fs/cgroup/memory/docker/*/memory.limit_in_bytes | head—if smaller than expected, misconfiguration.
Swap is the culprit: free -h | grep -i swap. If swap >0, disable it: swapoff -a. Kubelet assumes no swap (it doesn't account for swap when calculating available memory). With swap enabled, pod's memory limit includes swap space, so small limits cause OOM killer to trigger. Permanently disable: edit /etc/fstab and comment out swap line, then swapon -a to reload.
Verify fix: Restart kubelet after disabling swap: systemctl restart kubelet. Try scheduling a pod: kubectl run test --image=nginx—should start without OOMKilled. If OOMKilled persists, check node resource limits: ulimit -a on node. If max memory is set too low, increase it: echo "* soft memlock unlimited" >> /etc/security/limits.conf.
For cordoned nodes: cordon doesn't evict pods. If pods are evicting during maintenance window, a taint must be set: kubectl taint nodes nodename maintenance=true:NoExecute. This triggers eviction controller. Remove cordon after fixing: kubectl uncordon nodename.
Follow-up: Swap is disabled. OOMKilled continues. Pod requests 2GB memory but kubelet shows "oom killer event" within 100ms of startup. What kernel tuning could cause this?
You're scaling kubelet on your nodes from 1.28 to 1.31. After rolling restart, nodes report they can only accept 20 pods each (instead of 110). Old nodes still accept 110. Scheduler is rejecting new pods across the cluster because node capacity is halved.
Kubelet 1.31 changed how it calculates allocatable capacity. In older versions, --max-pods=110 is a hard limit. In 1.31, maxPods interacts with admission controller and systemd cgroup hierarchy differently. Check what changed: run kubectl describe node | grep -A 10 "Allocatable" on old and new nodes—compare pods capacity.
Root cause: Kubelet 1.31 added stricter enforcement of pod overhead. If pod overhead is defined in RuntimeClass, kubelet now subtracts it from allocatable capacity. Example: if 100 pods + 10 pods overhead = 110 total, but old kubelet only counted 110 without overhead. New kubelet reserves 10 for overhead, leaving 100 allocatable.
Check RuntimeClass definition: kubectl get runtimeclasses -o yaml | grep -A 5 "overhead". If overhead is set, that's the issue. For backward compatibility, remove overhead or reduce it: kubectl patch runtimeclass myruntime --type=merge -p '{"overhead": null}' (if not needed).
Alternative: max-pods might be capped by cgroup hierarchy. In cgroup v2, kubelet calculates maxPods from available PIDs: cat /proc/sys/kernel/pid_max / node_count. If pid_max is low (default 32768), kubelet may calculate maxPods=20 on large clusters. Increase: echo 1000000 > /proc/sys/kernel/pid_max, persist in /etc/sysctl.conf: kernel.pid_max=1000000.
Immediate: Explicitly set --max-pods=110 in kubelet config (don't rely on auto-detection): maxPods: 110 in /etc/kubernetes/kubelet/kubelet-config.yaml. Restart: systemctl restart kubelet. Verify: kubectl describe node | grep Allocated pods should show 110. For cluster-wide fix: use kubelet init daemonset to set flags on all nodes before 1.31 upgrade.
Follow-up: You've set maxPods to 110 but kubelet still reports 20. Kubelet logs show "calculated allocatable: 20 pods". What internal calculation is overriding your flag?