Kubernetes Interview — StatefulSet and Persistent Volume Lifecycle

Your Cassandra StatefulSet is running 3 replicas. Node-2 (running cassandra-1) suddenly fails and is terminated by the infrastructure. A new node is created automatically (via node auto-scaling). The pod cassandra-1 gets rescheduled to the new node, but the PersistentVolume from node-2 isn't mounted to the new node. Your cluster now has orphaned storage. Cassandra is inconsistent. How do you recover?

This is a fundamental StatefulSet + PV issue: PVs are tied to specific nodes (via node affinity or storage topology), but when a node fails, the PV becomes unreachable. The rescheduled pod can't claim the old PV, and data loss occurs.

Immediate diagnosis:
1. Identify the orphaned PV: kubectl get pv | grep cassandra kubectl describe pv cassandra-pv-1
Expected output: PVC not bound, but volume data exists in the old node's storage.
2. Check the new cassandra-1 pod:
kubectl get pod cassandra-1 -o yaml | grep -A 5 volumeMounts kubectl exec cassandra-1 -- df -h | grep cassandra
If no volume mounted, that's the problem.
3. Verify PVC status:
kubectl get pvc kubectl describe pvc cassandra-data-cassandra-1
If status is "Pending", the PVC can't find a suitable PV (likely due to node affinity mismatch).

Root cause analysis:
kubectl describe pvc cassandra-data-cassandra-1 | grep -A 10 Events
Look for: "no persistent volumes available" or "node affinity mismatch"

Recovery steps:
Option A: If data loss is acceptable (dev environment):
1. Delete the orphaned PV:
kubectl delete pv cassandra-pv-1
2. Restart the pod to force a new volume:
kubectl delete pod cassandra-1
3. Trigger Cassandra repair:
kubectl exec cassandra-0 -- nodetool repair

Option B: If data must be recovered (production):
1. Manually reattach the orphaned PV:
aws ec2 describe-volumes --filters Name=tag:kubernetes.io/created-for/pvc/name,Values=cassandra-data-cassandra-1
2. Detach and reattach to new node:
aws ec2 attach-volume --volume-id vol-xxxxx --instance-id i-newnode --device /dev/xvdb
3. Mount on node:
ssh new-node-ip sudo mkdir -p /mnt/cassandra-data sudo mount /dev/xvdb /mnt/cassandra-data
4. Update PV object:
kubectl patch pv cassandra-pv-1 -p '{"spec":{"nodeAffinity":{"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"kubernetes.io/hostname","operator":"In","values":["new-node-name"]}]}]}}}}'
5. Force PVC to bind:
kubectl delete pod cassandra-1

Prevention for the future:
1. Use cluster-aware storage (EBS in multi-zone, not single-zone AZ pinning):
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: cassandra-storage provisioner: ebs.csi.aws.com allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer
2. Implement pod disruption budgets:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: cassandra-pdb spec: minAvailable: 2 selector: matchLabels: app: cassandra
3. Monitor for orphaned PVs:
kubectl get pv | grep Available

Follow-up: How would you design a backup strategy to protect against complete PV loss? Should you use snapshots, replication, or both?

Your MongoDB StatefulSet has 5 replicas. During a cluster upgrade, Pod 3 is evicted (graceful termination). 30 seconds later, a replacement Pod 3 starts on a different node, claims the same PVC, and mounts the same PV. But the PV has stale data from 5 minutes ago—the old pod never fully synced. Your cluster now has data inconsistency. How does this happen and how do you prevent it?

The issue: StatefulSets guarantee ordinal identity and stable PV binding, but they don't guarantee data consistency during pod replacement. Between the old pod terminating and the new pod starting, uncommitted writes are lost.

Root cause: termination grace period is too short, or the application doesn't flush its state properly before termination.

Diagnosis:
1. Check pod termination logs: kubectl describe pod mongodb-3 kubectl logs mongodb-3 --previous --tail=100 | tail -20
2. Check PV mount status:
kubectl get events | grep -E 'mongodb-3|PV'
3. Verify data sync before pod termination:
kubectl exec mongodb-3 -- db.adminCommand({replSetGetStatus: 1}) | grep -E 'oplog|replication'

Fix 1: Increase termination grace period
apiVersion: apps/v1 kind: StatefulSet metadata: name: mongodb spec: template: spec: terminationGracePeriodSeconds: 300 containers: - name: mongodb image: mongo:6.0 lifecycle: preStop: exec: command: - /bin/sh - -c - | mongosh --eval 'db.adminCommand({fsync: 1})' --quiet mongosh --eval 'db.shutdownServer()' --quiet || true

Fix 2: Use pod lifecycle hooks to ensure data consistency
lifecycle: preStop: exec: command: - /bin/bash - -c - | mongosh --eval 'db.adminCommand({replSetStepDown: 10})' --quiet sleep 30 mongosh --eval 'db.adminCommand({fsync: 1})' --quiet

Fix 3: Use readiness probe that checks replication lag
readinessProbe: exec: command: - /bin/sh - -c - mongosh --eval 'exit(db.adminCommand({isMaster: true}).ok ? 0 : 1)' --quiet initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3

Fix 4: Use OrderedReady pod management
apiVersion: apps/v1 kind: StatefulSet metadata: name: mongodb spec: podManagementPolicy: OrderedReady # Ensures sequential startup

Follow-up: How do you handle ordered startup for a 10-replica StatefulSet where each startup takes 5 minutes? Design a faster startup strategy without losing consistency.

You have a PersistentVolume (10GB) that's almost full (9.8GB). Your application doesn't handle out-of-disk errors gracefully and crashes with ENOSPC. Manual expansion via kubectl patch is risky while the pod is running. How do you safely expand the volume without data loss or downtime?

PV expansion in Kubernetes requires careful coordination: expand the storage claim, then the filesystem inside the volume.

Phase 1: Pre-expansion validation
1. Check if StorageClass allows expansion: kubectl get storageclass -o yaml | grep allowVolumeExpansion
2. Verify application can handle resize:
kubectl exec app-pod -- df -h | grep /data
3. Create a snapshot (safety net):
kubectl exec app-pod -- tar -czf /data/backup-$(date +%s).tar.gz /data

Phase 2: Zero-downtime expansion (recommended)
1. Expand the PVC: kubectl patch pvc app-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
2. Verify PVC expansion:
kubectl get pvc app-pvc
3. Check if filesystem auto-expanded:
kubectl exec app-pod -- df -h | grep /data
4. If not auto-expanded, trigger it:
kubectl exec app-pod -- resize2fs /dev/xvdb # For ext4
5. Verify:
kubectl exec app-pod -- df -h

Phase 3: Graceful expansion with brief downtime (if zero-downtime fails)
1. Stop the application: kubectl scale deployment app --replicas=0
2. Expand PVC:
kubectl patch pvc app-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
3. Resize filesystem:
ssh node-running-old-pod sudo mount /dev/xvdb /tmp/mount-test sudo resize2fs /dev/xvdb sudo umount /tmp/mount-test
4. Restart the pod:
kubectl scale deployment app --replicas=1

Phase 4: Prevent future issues
- alert: DiskUsageHigh expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8 for: 5m annotations: summary: "Volume {{ $labels.persistentvolumeclaim }} is 80% full"

Follow-up: How would you design a multi-tier storage strategy where cold data is moved to cheaper storage automatically?

You're running a Kubernetes StatefulSet for an Elasticsearch cluster (3 master, 5 data nodes). A rolling update is in progress. During the update, Pod data-2 is terminated before the data is synced to other nodes. The new data-2 pod starts fresh with an empty PV. Elasticsearch shard rebalancing triggers and tries to restore shards from data-2, but they're gone. The cluster health drops to RED. Walk through recovery.

This is a critical scenario combining stateful pod replacement, PV lifecycle, and distributed system consistency. Recovery depends on Elasticsearch's shard replica distribution.

Immediate diagnosis:
1. Check cluster health: kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cluster/health | jq '.status, .unassigned_shards'
2. Identify missing shards:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED
3. Check node status:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cat/nodes?v

Root cause: The old data-2 pod held primary shards. When replaced with an empty PV, those shards became unassigned. Elasticsearch waits for the original node to come back (5 minute timeout).

Recovery Phase 1: Immediate stabilization (5 minutes)
Option A: Wait for timeout (safer if you have replicas):
sleep 300 # Wait for index.unassigned.node_left.delayed_timeout kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cluster/health | jq '.status'
Option B: Force shard allocation (risky, data loss):
kubectl exec elasticsearch-master-0 -- curl -s -X POST localhost:9200/_cluster/reroute?allow_no_indices=true -H 'Content-Type: application/json' -d '{ "commands": [ { "allocate_empty_primary": { "index": "index-name", "shard": 0, "node": "data-2" } } ] }'

Recovery Phase 2: Restore from snapshot (production-safe)
1. Verify snapshot repository: kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_snapshot/_all
2. List available snapshots:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_snapshot/backup/_all
3. Restore specific indices:
kubectl exec elasticsearch-master-0 -- curl -s -X POST localhost:9200/_snapshot/backup/snapshot-id/_restore -H 'Content-Type: application/json' -d '{ "indices": "index-name-*", "ignore_unavailable": true }'

Prevention: Proper StatefulSet configuration
apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch-data spec: terminationGracePeriodSeconds: 600 updateStrategy: type: RollingUpdate rollingUpdate: partition: 0 template: spec: containers: - name: elasticsearch lifecycle: preStop: exec: command: - /bin/bash - -c - | curl -s -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{ "transient": { "cluster.routing.allocation.enable": "none" } }' sleep 30

Follow-up: Design a rolling update strategy for Elasticsearch that guarantees zero shard loss. How would you coordinate pod updates with shard rebalancing?

Your Redis Sentinel StatefulSet (1 master, 2 replicas) uses local node storage (hostPath PV, not networked). Master redis-0 has a catastrophic failure. New master is elected (redis-1). But redis-0's PV is gone—it's local node storage and the node is being replaced. Redis Sentinel doesn't know how to handle PV loss. Your cluster can't recover data. Design a better architecture.

hostPath volumes are dangerous for StatefulSets: data is tied to a single node and is lost when the node dies. For Redis Sentinel or any stateful workload, you need truly persistent, network-accessible storage.

Current architecture analysis:
1. Identify the problem: kubectl describe statefulset redis | grep -A 5 volumeMounts kubectl get pv | grep redis
2. Check data loss:
kubectl exec redis-1 -- redis-cli info replication

Migration to networked storage:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: redis-storage provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer

Create new StatefulSet with networked storage:
apiVersion: apps/v1 kind: StatefulSet metadata: name: redis-new spec: serviceName: redis replicas: 3 selector: matchLabels: app: redis template: spec: containers: - name: redis image: redis:7.0 ports: - containerPort: 6379 name: redis volumeMounts: - name: redis-data mountPath: /data volumeClaimTemplates: - metadata: name: redis-data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: redis-storage resources: requests: storage: 50Gi

Data migration from old to new:
1. Sync data from old to new: kubectl exec redis-old-1 -- redis-cli SLAVEOF redis-new-0 6379 sleep 30

Verify replication: kubectl exec redis-new-0 – redis-cli info replication | grep connected_slaves

Key difference:
OLD: hostPath → single node → failure = data loss
NEW: EBS networked storage → data survives node failure → Sentinel promotes replica automatically

Follow-up: Your cluster spans multiple availability zones. Networked storage (EBS) is single-AZ. How do you ensure data survives an entire AZ failure?

You have a RabbitMQ StatefulSet with 3 replicas and shared disk storage. During pod scheduling, rabbitmq-2 is assigned to node-3 before the PV is available (storage provisioning takes 2 minutes). The pod enters ImagePullBackOff then CrashLoopBackOff. Once storage arrives, the pod still can't mount it. The rabbitmq-2 ordinal is stuck. How do you recover and prevent this?

This is a scheduling race condition: StatefulSets assume storage is available, but asynchronous volume provisioning can delay binding. The pod is scheduled but can't mount its volume, leading to unrecoverable state.

Diagnosis:
1. Check pod status: kubectl describe pod rabbitmq-2
2. Check PVC status:
kubectl get pvc rabbitmq-data-rabbitmq-2 kubectl describe pvc rabbitmq-data-rabbitmq-2
3. Check PV status:
kubectl get pv | grep rabbitmq

Root cause: PVC bound to PV but PV not yet mounted on the node. The pod is crashing because /var/lib/rabbitmq is empty.

Immediate recovery:
Option A: Force pod rescheduling:
kubectl delete pod rabbitmq-2
Option B: Manually mount the PV on the node:
ssh node-3 sudo mount /dev/xvdb /mnt/rabbitmq-storage kubectl delete pod rabbitmq-2

Prevention: Use WaitForFirstConsumer volumeBindingMode
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rabbitmq-storage provisioner: ebs.csi.aws.com volumeBindingMode: WaitForFirstConsumer # KEY: Bind at pod scheduling time

Better StatefulSet configuration:
apiVersion: apps/v1 kind: StatefulSet metadata: name: rabbitmq spec: podManagementPolicy: OrderedReady # Ensures sequential startup template: spec: initContainers: - name: verify-storage image: busybox:latest command: - sh - -c - | while [ ! -d /var/lib/rabbitmq ]; do echo "Waiting for volume mount..." sleep 2 done df -h /var/lib/rabbitmq volumeMounts: - name: rabbitmq-data mountPath: /var/lib/rabbitmq volumeClaimTemplates: - metadata: name: rabbitmq-data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: rabbitmq-storage resources: requests: storage: 50Gi

Monitoring and alerting:
- alert: StatefulSetPodPending expr: kube_pod_status_phase{phase="Pending"} > 300 annotations: summary: "RabbitMQ pod stuck in Pending for 5min"

alert: VolumeNotMounted expr: kubelet_volume_stats_available_bytes == 0 for: 5m annotations: summary: "Volume not mounted on {{ $labels.node }}"

Follow-up: How do you design a StatefulSet that's resilient to storage provisioning delays? What's the maximum tolerable startup time and how do you monitor it?

Your Kafka broker StatefulSet (kafka-0, kafka-1, kafka-2) each has a 500GB PV. Due to a misconfiguration, all three PVs are accidentally deleted from your storage backend (EBS in AWS). Kubernetes still has PVC/PV objects, but the underlying data is gone. You have no backup. Your Kafka cluster is completely down. How do you recover and what can you save?

This is a catastrophic scenario: the underlying storage is gone, PVs are orphaned, and Kubernetes objects point to non-existent data. Recovery depends on whether you have consumer group offsets stored elsewhere.

Diagnosis and assessment of damage:
1. Verify PVs are orphaned: kubectl get pv | grep kafka kubectl describe pv kafka-pv-0 | grep volumeHandle
2. Confirm volumes deleted:
aws ec2 describe-volumes --volume-ids vol-xxxxx # Error: InvalidVolume.NotFound = data is gone
3. Check Kafka cluster state:
kubectl port-forward kafka-0 9092:9092 & kafka-broker-api-versions.sh --bootstrap-server localhost:9092

Recovery strategy (depends on acceptable data loss):
Scenario A: Complete recovery with zero data loss (requires backups)
1. Verify backup exists: aws s3 ls s3://kafka-backups/



Create new EBS volumes from snapshots:
SNAPSHOT_IDS=("snap-xxxxx" "snap-yyyyy" "snap-zzzzz")
for i in 0 1 2; do
aws ec2 create-volume --snapshot-id ${SNAPSHOT_IDS[$i]} --availability-zone us-east-1a --volume-type gp3
done


Update PV objects:
kubectl patch pv kafka-pv-0 -p '{"spec":{"awsElasticBlockStore":{"volumeID":"vol-newid"}}}'

Restart Kafka: kubectl delete pod kafka-0 kafka-1 kafka-2

Scenario B: Rebuild cluster without data
1. Delete orphaned resources:


kubectl delete statefulset kafka
kubectl delete pvc --all -l app=kafka



Create new cluster with new PVs:
kubectl apply -f kafka-statefulset-new.yaml

Recreate topics from external list: cat topics.txt | while read topic; do kafka-topics.sh --bootstrap-server kafka-0:9092 --create --topic $topic --partitions 3 --replication-factor 3 done

Long-term prevention:
1. Automated backup strategy:
apiVersion: v1

kind: CronJob metadata: name: kafka-backup spec: schedule: "0 2 * * *" jobTemplate: spec: template: spec: containers: - name: backup image: amazon/aws-cli command: - /bin/bash - -c - | for vol in $(aws ec2 describe-volumes --query 'Volumes[?Tags[?Key==app].Value==kafka].VolumeId' --output text); do aws ec2 create-snapshot --volume-id $vol --description "Kafka backup $(date +%Y-%m-%d)" done
2. Use persistent, multi-zone storage
3. Enable audit logging to detect accidental deletions
4. Implement GitOps for topic definitions

Follow-up: Design a disaster recovery plan for Kafka that survives complete cluster destruction. What's your backup strategy and RTO/RPO targets?

Your ZooKeeper StatefulSet is using a local storage (hostPath) for state. A zk-0 pod crashes and is rescheduled to a different node. The new zk-0 has an empty local PV and restarts as a fresh node (no prior state). The quorum is broken because zk-0 doesn't know about previous commits. The cluster becomes unavailable. How do you design persistent state for ZooKeeper?

ZooKeeper requires persistent state across pod restarts. Using hostPath (local storage) creates exactly this problem: data is tied to a node and lost on reschedule.

Problem with hostPath:
- Pod crash on node-1 → rescheduled to node-2
- New pod has empty local storage (node-1's data is inaccessible)
- zk-0 joins quorum as new member, loses all state
- Cluster data inconsistency or quorum loss

Solution: Network storage (EBS, NFS, cloud-native)
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: zookeeper-storage provisioner: ebs.csi.aws.com parameters: type: gp3 allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer
StatefulSet:
apiVersion: apps/v1 kind: StatefulSet metadata: name: zookeeper spec: serviceName: zookeeper replicas: 3 selector: matchLabels: app: zookeeper template: spec: containers: - name: zookeeper image: zookeeper:3.8 ports: - containerPort: 2181 - containerPort: 2888 - containerPort: 3888 env: - name: ZK_SERVER_HEAP value: "1024" volumeMounts: - name: datadir mountPath: /var/lib/zookeeper/data - name: datalog mountPath: /var/lib/zookeeper/datalog volumeClaimTemplates: - metadata: name: datadir spec: accessModes: [ "ReadWriteOnce" ] storageClassName: zookeeper-storage resources: requests: storage: 10Gi - metadata: name: datalog spec: accessModes: [ "ReadWriteOnce" ] storageClassName: zookeeper-storage resources: requests: storage: 5Gi
When pod rescheduled: PVs follow pod to new node, ZooKeeper data is preserved.

Prevention and monitoring:
- Use network storage for all StatefulSet workloads
- Monitor ZooKeeper quorum health
- Alert on quorum loss
- Test pod rescheduling regularly

Follow-up: How do you handle multi-region StatefulSet state when storage is single-region?