Kubernetes Interview — Pod CrashLoopBackOff Debugging

You deploy a config change to your web service at 9:15am. By 9:18am, your entire deployment is in CrashLoopBackOff. Prometheus alerts fire. Your team is paging you. Walk through your exact triage steps.

First command: kubectl describe pod <pod-name>. The Events section tells the story—I'm looking for the actual reason Kubernetes killed the pod. Is it OOMKilled? CrashLoopBackOff? ImagePullBackOff? Unschedulable? The event message is gold.

Second: kubectl logs <pod-name> --previous to see stdout/stderr from the crashed container. If there are init containers, I grab those logs too: kubectl logs <pod-name> -c init-app --previous. The crash logs are my north star—they tell me whether the app crashed or couldn't start.

Third: check the deployment status immediately: kubectl get deployment -o wide and kubectl rollout status deployment/<name>. Then I look at the exact change I deployed: kubectl rollout history deployment/<name> to see revision numbers, then kubectl rollout history deployment/<name> --revision=N to diff what changed.

Fourth: validate my change manually. If I changed ConfigMap, I verify it exists and mounted correctly: kubectl describe configmap <name> and check the pod's volume mounts. If I changed the image, I verify the image exists in my registry and pulls cleanly: docker pull gcr.io/project/image:tag. If there's auth issues, I check ImagePullSecret in the pod spec.

Common culprits: (1) env vars referencing a ConfigMap/Secret that doesn't exist yet, (2) new volume mounts pointing to ConfigMaps that have a typo, (3) image pull failures due to wrong tag or bad credentials, (4) app expecting a file or port that the new config doesn't provide. My rollback is instant: kubectl rollout undo deployment/<name> while I investigate. Alerting the team that I'm investigating buys time.

Follow-up: The pod started successfully on Monday but crashes on Tuesday with the exact same image and config unchanged. What happened?

A pod crashes with exit code 137. Your logs show nothing unusual. What's happening?

Exit code 137 = 128 + 9 = SIGKILL (signal 9). This is Kubernetes forcefully terminating the process, not the app exiting. The usual reason is resource pressure. Check kubectl describe pod for "OOMKilled: true" in the status. If it was OOMKilled, the container hit its memory limit, not ran out of actual system memory.

Verify what the pod's limits are: kubectl get pod -o jsonpath='{.spec.containers[0].resources}' or use the describe output. Is the limit way too low relative to what the app needs? I cross-check with the actual RSS the app was using before it crashed. For Java apps especially, I account for heap + metaspace + off-heap—often devs set limits to heap size only and wonder why it crashes.

Check if the node itself had memory pressure: kubectl describe node <node> to see MemoryPressure status. If the node was under pressure, Kubernetes evicts pods based on QoS class and resource usage. Even a pod under its limit can be evicted if the node is critical.

I also verify that no sidecar or init container leaked memory. If there's a Prometheus agent or logging sidecar, sometimes they buffer data and consume memory you didn't account for. Use kubectl top pod --containers to see per-container usage (requires metrics-server).

Follow-up: Your app is Java, runs fine locally with 2GB heap, but crashes with memory limit of 3GB in Kubernetes. Why?

Every pod crashes immediately after you add a readiness probe. Logs are clean, no errors. What's your first suspicion?

The readiness probe itself is failing, and Kubernetes is interpreting that as the pod being unhealthy. I check the exact probe definition: kubectl get pod -o yaml | grep -A10 readinessProbe. Common issues: (1) wrong port number, (2) wrong path (especially if it includes a basePath), (3) initialDelaySeconds too low—the app isn't ready by then, (4) incorrect HTTP method or headers, (5) execProbe script has the wrong shebang or doesn't exist in the image.

I verify the probe manually. If it's an HTTP probe on port 8080: kubectl exec <pod> -- curl -v http://localhost:8080/health. Does it return 200? If it times out or hangs, there's network or app logic issue. For exec probes, I try the exact command inside the pod: kubectl exec <pod> -- /bin/sh -c 'my-probe-script.sh' to see what error it throws.

If the probe is passing manually but failing in Kubernetes, I check the probe timeout: timeoutSeconds might be too short for a slow startup. Or failureThreshold is 1 and any transient failure evicts the pod. I start with high timeouts and low thresholds to get the pod stable, then tune down once it's working.

Follow-up: The readiness probe is a shell script that checks a file exists in a shared volume. Sometimes it passes, sometimes it fails. Why?

Your init container runs fine during local Docker build but crashes when deployed to Kubernetes. It fails silently—no logs visible.

Get the init container logs: kubectl logs <pod> -c <init-container-name>. If that's empty, try --previous flag: kubectl logs <pod> -c <init-container-name> --previous. If still nothing, check the pod's status phase: kubectl get pod <pod> -o jsonpath='{.status}' to see if it even started the init container or if it failed at scheduling.

The init container environment in Kubernetes differs from Docker: no host mounts usually, no inherited env vars unless you explicitly pass them. Check if your init container is trying to access a volume that doesn't exist—kubectl describe pod will show which volumes are defined. If the init container expects /data but the volume isn't mounted, it crashes.

Permission issues are sneaky. If your init container runs as a different user (uid:gid) than the main container, and there's an emptyDir volume, permission denied errors can be hard to spot. I verify the security context: kubectl get pod -o yaml | grep -A5 securityContext and check if the init container has different runAsUser/runAsGroup.

Resource limits matter too. If the init container has a small memory limit and tries to unpack a large tarball or process a big file, it can OOMKill. Check the initContainers resource requests/limits separately from main container limits.

Follow-up: Your init container succeeds on 3 of 5 nodes but crashes on 2 others. All nodes are identical. What's happening?

A pod is stuck in Pending and never enters CrashLoopBackOff or Running. It just sits there. kubectl describe shows "No events yet". What's your next move?

Pending means the pod was accepted but the scheduler can't find a node for it. Refresh the describe output aggressively: kubectl describe pod -w to watch for events. Sometimes events take a few seconds to surface.

Check if the pod is even being considered for scheduling: kubectl get pod -o yaml and look at the nodeName field. If it's empty, the scheduler hasn't placed it yet. Check if there's a NodeSelector or nodeAffinity: kubectl get pod -o yaml | grep -A5 nodeSelector. If you're selecting a node label that doesn't exist on any node, it'll be Pending forever.

Check node resources: kubectl top nodes and kubectl describe nodes. Are all nodes full (CPU, memory, ephemeral storage)? If so, look at pod request sizes: kubectl get pod -o yaml | grep -A5 resources. Is the pod requesting 16 cores on a 4-core node? That's an instant Pending.

Check for taints and tolerations: kubectl describe node <node> to see Taints. If the node is tainted (like NoSchedule) and the pod doesn't tolerate it, the pod won't schedule. Check pod tolerations: kubectl get pod -o yaml | grep -A10 tolerations.

If none of that applies, check if there's a PodDisruptionBudget or priority conflicts causing scheduling delays. In rare cases, a custom scheduler might be configured. The bottom line: Pending = scheduler can't place it, not image/config issues.

Follow-up: You add a toleration for the taint, the pod schedules, but then immediately crashes. Why was it Pending in the first place?

You have 10 identical pods. 9 run fine, 1 crashes immediately with a slightly different error each time. What's going on?

Different errors on each restart = race condition or node-specific issue. Get the exact error from the latest crash: kubectl logs <pod> --previous and note the error, then let it crash again and check the new error: kubectl logs <pod> --previous. Are they actually different or just the crash-counter changing?

Check which node the pod is scheduled on: kubectl get pod -o wide. Compare the crashing pod's node with the working pods. Are the working pods all on different nodes? Does the crashing pod always land on the same node? If so, that node might have missing libraries, different kernel version, different disk speed, or environment variable differences.

Run a diagnostic on the node: kubectl debug node/<node-name> -it --image=ubuntu to get a shell on the node and check dmesg for hardware issues, disk space, or kernel errors: dmesg | tail -50. Look for CPU errors, disk I/O errors, or memory issues.

If the pod has local storage requirements (reading from /dev, host path volumes), node differences matter a lot. Check the pod's volumes: kubectl get pod -o yaml | grep -A20 volumes. If there's a hostPath, verify it exists on the crashing node: kubectl debug node/<node> -it --image=ubuntu -- ls -la /the/host/path.

Follow-up: The pod crashes only when scheduled to a node running an older kernel. How do you prevent this permanently?

You deploy a new version of your app. The pod enters CrashLoopBackOff. Logs show connection refused to a database. But the database is running, and the old pod version connects fine. What happened?

The new pod version has different network connectivity or DNS resolution than the old one. First, check if DNS is resolving the database hostname inside the pod: kubectl exec <pod> -- nslookup db.production.svc.cluster.local. Does it return an IP? If not, check /etc/resolv.conf inside the pod for the DNS servers, and verify they're reachable.

Check if there's a difference in network policies: kubectl get networkpolicies -A and specifically look for policies on the pod's namespace that might be blocking egress. Verify the pod has egress rules to the database port.

Compare the pod specs between old and new: kubectl rollout history deployment/<name> and check the diff: kubectl set image deployment/<name> --record --dry-run=client -o yaml | diff - <old-revision-yaml>. Did you change the serviceAccountName, add a sidecar with network limitations, or modify nodeAffinity in a way that puts it on a different node with different network rules?

Check the actual connectivity from inside the pod: kubectl exec <pod> -- nc -zv db.production.svc.cluster.local 5432 or use curl if it's HTTP. If the connection times out instead of being refused, there's network latency or the port is closed. If it's refused, the service or database endpoint isn't listening on that port.

Verify the service endpoint exists and is populated: kubectl get endpoints db-service -o yaml. If endpoints are empty, no pods are backing the service, so any connection attempt fails.

Follow-up: The database service endpoint is populated correctly. The old pod can connect, but the new one can't, even though they're on the same node. What's the difference?

A pod crashes with exit code 143. This is different from 137. The logs show a clean shutdown. But it's still CrashLoopBackOff. Why?

Exit code 143 = 128 + 15 = SIGTERM (signal 15). This is a graceful shutdown. Kubernetes is deliberately terminating the pod, not force-killing it. The pod received a SIGTERM, shut down cleanly, exited with code 143, and Kubernetes restarted it. This is expected behavior during rollouts, node drains, or pod evictions.

If it's continuously restarting with 143, the issue is the restart policy. Check: kubectl get pod -o yaml | grep -A5 'restartPolicy'. By default it's Always, so any exit (even graceful) restarts the pod. If the app is supposed to be a one-time job, restartPolicy should be Never or OnFailure.

But if this is a long-running service and it shouldn't be exiting at all, there's something triggering the shutdown. Is there a liveness probe that's failing? Is the pod being evicted due to resource pressure? Check events: kubectl describe pod -w. Look for "Killing" events—that means kubelet is sending SIGTERM.

Verify the pod isn't being scaled down or the deployment isn't being updated: kubectl get deployment -o yaml | grep replicas. If you scaled to 0, obviously pods terminate.

If the pod is exiting cleanly and restarting indefinitely, and you don't want that, change restartPolicy to Never. If it should stay running, investigate why your app is exiting (check logs from before the exit).

Follow-up: The exit code is 143 but the logs show the app crashed with a panic, not a clean shutdown. How is this possible?