Your Cassandra StatefulSet is running 3 replicas. Node-2 (running cassandra-1) suddenly fails and is terminated by the infrastructure. A new node is created automatically (via node auto-scaling). The pod cassandra-1 gets rescheduled to the new node, but the PersistentVolume from node-2 isn't mounted to the new node. Your cluster now has orphaned storage. Cassandra is inconsistent. How do you recover?
This is a fundamental StatefulSet + PV issue: PVs are tied to specific nodes (via node affinity or storage topology), but when a node fails, the PV becomes unreachable. The rescheduled pod can't claim the old PV, and data loss occurs.
Immediate diagnosis:
1. Identify the orphaned PV:
kubectl get pv | grep cassandra
kubectl describe pv cassandra-pv-1
Expected output: PVC not bound, but volume data exists in the old node's storage.
2. Check the new cassandra-1 pod:
kubectl get pod cassandra-1 -o yaml | grep -A 5 volumeMounts
kubectl exec cassandra-1 -- df -h | grep cassandra
If no volume mounted, that's the problem.
3. Verify PVC status:
kubectl get pvc
kubectl describe pvc cassandra-data-cassandra-1
If status is "Pending", the PVC can't find a suitable PV (likely due to node affinity mismatch).
Root cause analysis:
kubectl describe pvc cassandra-data-cassandra-1 | grep -A 10 Events
Look for: "no persistent volumes available" or "node affinity mismatch"
Recovery steps:
Option A: If data loss is acceptable (dev environment):
1. Delete the orphaned PV:
kubectl delete pv cassandra-pv-1
2. Restart the pod to force a new volume:
kubectl delete pod cassandra-1
3. Trigger Cassandra repair:
kubectl exec cassandra-0 -- nodetool repair
Option B: If data must be recovered (production):
1. Manually reattach the orphaned PV:
aws ec2 describe-volumes --filters Name=tag:kubernetes.io/created-for/pvc/name,Values=cassandra-data-cassandra-1
2. Detach and reattach to new node:
aws ec2 attach-volume --volume-id vol-xxxxx --instance-id i-newnode --device /dev/xvdb
3. Mount on node:
ssh new-node-ip
sudo mkdir -p /mnt/cassandra-data
sudo mount /dev/xvdb /mnt/cassandra-data
4. Update PV object:
kubectl patch pv cassandra-pv-1 -p '{"spec":{"nodeAffinity":{"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"kubernetes.io/hostname","operator":"In","values":["new-node-name"]}]}]}}}}'
5. Force PVC to bind:
kubectl delete pod cassandra-1
Prevention for the future:
1. Use cluster-aware storage (EBS in multi-zone, not single-zone AZ pinning):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cassandra-storage
provisioner: ebs.csi.aws.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
2. Implement pod disruption budgets:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: cassandra-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: cassandra
3. Monitor for orphaned PVs:
kubectl get pv | grep Available
Follow-up: How would you design a backup strategy to protect against complete PV loss? Should you use snapshots, replication, or both?
Your MongoDB StatefulSet has 5 replicas. During a cluster upgrade, Pod 3 is evicted (graceful termination). 30 seconds later, a replacement Pod 3 starts on a different node, claims the same PVC, and mounts the same PV. But the PV has stale data from 5 minutes ago—the old pod never fully synced. Your cluster now has data inconsistency. How does this happen and how do you prevent it?
The issue: StatefulSets guarantee ordinal identity and stable PV binding, but they don't guarantee data consistency during pod replacement. Between the old pod terminating and the new pod starting, uncommitted writes are lost.
Root cause: termination grace period is too short, or the application doesn't flush its state properly before termination.
Diagnosis:
1. Check pod termination logs:
kubectl describe pod mongodb-3
kubectl logs mongodb-3 --previous --tail=100 | tail -20
2. Check PV mount status:
kubectl get events | grep -E 'mongodb-3|PV'
3. Verify data sync before pod termination:
kubectl exec mongodb-3 -- db.adminCommand({replSetGetStatus: 1}) | grep -E 'oplog|replication'
Fix 1: Increase termination grace period
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mongodb
spec:
template:
spec:
terminationGracePeriodSeconds: 300
containers:
- name: mongodb
image: mongo:6.0
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
mongosh --eval 'db.adminCommand({fsync: 1})' --quiet
mongosh --eval 'db.shutdownServer()' --quiet || true
Fix 2: Use pod lifecycle hooks to ensure data consistency
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- |
mongosh --eval 'db.adminCommand({replSetStepDown: 10})' --quiet
sleep 30
mongosh --eval 'db.adminCommand({fsync: 1})' --quiet
Fix 3: Use readiness probe that checks replication lag
readinessProbe:
exec:
command:
- /bin/sh
- -c
- mongosh --eval 'exit(db.adminCommand({isMaster: true}).ok ? 0 : 1)' --quiet
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
Fix 4: Use OrderedReady pod management
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mongodb
spec:
podManagementPolicy: OrderedReady # Ensures sequential startup
Follow-up: How do you handle ordered startup for a 10-replica StatefulSet where each startup takes 5 minutes? Design a faster startup strategy without losing consistency.
You have a PersistentVolume (10GB) that's almost full (9.8GB). Your application doesn't handle out-of-disk errors gracefully and crashes with ENOSPC. Manual expansion via kubectl patch is risky while the pod is running. How do you safely expand the volume without data loss or downtime?
PV expansion in Kubernetes requires careful coordination: expand the storage claim, then the filesystem inside the volume.
Phase 1: Pre-expansion validation
1. Check if StorageClass allows expansion:
kubectl get storageclass -o yaml | grep allowVolumeExpansion
2. Verify application can handle resize:
kubectl exec app-pod -- df -h | grep /data
3. Create a snapshot (safety net):
kubectl exec app-pod -- tar -czf /data/backup-$(date +%s).tar.gz /data
Phase 2: Zero-downtime expansion (recommended)
1. Expand the PVC:
kubectl patch pvc app-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
2. Verify PVC expansion:
kubectl get pvc app-pvc
3. Check if filesystem auto-expanded:
kubectl exec app-pod -- df -h | grep /data
4. If not auto-expanded, trigger it:
kubectl exec app-pod -- resize2fs /dev/xvdb # For ext4
5. Verify:
kubectl exec app-pod -- df -h
Phase 3: Graceful expansion with brief downtime (if zero-downtime fails)
1. Stop the application:
kubectl scale deployment app --replicas=0
2. Expand PVC:
kubectl patch pvc app-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
3. Resize filesystem:
ssh node-running-old-pod
sudo mount /dev/xvdb /tmp/mount-test
sudo resize2fs /dev/xvdb
sudo umount /tmp/mount-test
4. Restart the pod:
kubectl scale deployment app --replicas=1
Phase 4: Prevent future issues
- alert: DiskUsageHigh
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
for: 5m
annotations:
summary: "Volume {{ $labels.persistentvolumeclaim }} is 80% full"
Follow-up: How would you design a multi-tier storage strategy where cold data is moved to cheaper storage automatically?
You're running a Kubernetes StatefulSet for an Elasticsearch cluster (3 master, 5 data nodes). A rolling update is in progress. During the update, Pod data-2 is terminated before the data is synced to other nodes. The new data-2 pod starts fresh with an empty PV. Elasticsearch shard rebalancing triggers and tries to restore shards from data-2, but they're gone. The cluster health drops to RED. Walk through recovery.
This is a critical scenario combining stateful pod replacement, PV lifecycle, and distributed system consistency. Recovery depends on Elasticsearch's shard replica distribution.
Immediate diagnosis:
1. Check cluster health:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cluster/health | jq '.status, .unassigned_shards'
2. Identify missing shards:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED
3. Check node status:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cat/nodes?v
Root cause: The old data-2 pod held primary shards. When replaced with an empty PV, those shards became unassigned. Elasticsearch waits for the original node to come back (5 minute timeout).
Recovery Phase 1: Immediate stabilization (5 minutes)
Option A: Wait for timeout (safer if you have replicas):
sleep 300 # Wait for index.unassigned.node_left.delayed_timeout
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_cluster/health | jq '.status'
Option B: Force shard allocation (risky, data loss):
kubectl exec elasticsearch-master-0 -- curl -s -X POST localhost:9200/_cluster/reroute?allow_no_indices=true -H 'Content-Type: application/json' -d '{
"commands": [
{
"allocate_empty_primary": {
"index": "index-name",
"shard": 0,
"node": "data-2"
}
}
]
}'
Recovery Phase 2: Restore from snapshot (production-safe)
1. Verify snapshot repository:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_snapshot/_all
2. List available snapshots:
kubectl exec elasticsearch-master-0 -- curl -s localhost:9200/_snapshot/backup/_all
3. Restore specific indices:
kubectl exec elasticsearch-master-0 -- curl -s -X POST localhost:9200/_snapshot/backup/snapshot-id/_restore -H 'Content-Type: application/json' -d '{
"indices": "index-name-*",
"ignore_unavailable": true
}'
Prevention: Proper StatefulSet configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch-data
spec:
terminationGracePeriodSeconds: 600
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
template:
spec:
containers:
- name: elasticsearch
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- |
curl -s -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}'
sleep 30
Follow-up: Design a rolling update strategy for Elasticsearch that guarantees zero shard loss. How would you coordinate pod updates with shard rebalancing?
Your Redis Sentinel StatefulSet (1 master, 2 replicas) uses local node storage (hostPath PV, not networked). Master redis-0 has a catastrophic failure. New master is elected (redis-1). But redis-0's PV is gone—it's local node storage and the node is being replaced. Redis Sentinel doesn't know how to handle PV loss. Your cluster can't recover data. Design a better architecture.
hostPath volumes are dangerous for StatefulSets: data is tied to a single node and is lost when the node dies. For Redis Sentinel or any stateful workload, you need truly persistent, network-accessible storage.
Current architecture analysis:
1. Identify the problem:
kubectl describe statefulset redis | grep -A 5 volumeMounts
kubectl get pv | grep redis
2. Check data loss:
kubectl exec redis-1 -- redis-cli info replication
Migration to networked storage:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: redis-storage
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Create new StatefulSet with networked storage:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-new
spec:
serviceName: redis
replicas: 3
selector:
matchLabels:
app: redis
template:
spec:
containers:
- name: redis
image: redis:7.0
ports:
- containerPort: 6379
name: redis
volumeMounts:
- name: redis-data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: redis-storage
resources:
requests:
storage: 50Gi
Data migration from old to new:
1. Sync data from old to new:
kubectl exec redis-old-1 -- redis-cli SLAVEOF redis-new-0 6379
sleep 30
Key difference:
OLD: hostPath → single node → failure = data loss
NEW: EBS networked storage → data survives node failure → Sentinel promotes replica automatically
Follow-up: Your cluster spans multiple availability zones. Networked storage (EBS) is single-AZ. How do you ensure data survives an entire AZ failure?
You have a RabbitMQ StatefulSet with 3 replicas and shared disk storage. During pod scheduling, rabbitmq-2 is assigned to node-3 before the PV is available (storage provisioning takes 2 minutes). The pod enters ImagePullBackOff then CrashLoopBackOff. Once storage arrives, the pod still can't mount it. The rabbitmq-2 ordinal is stuck. How do you recover and prevent this?
This is a scheduling race condition: StatefulSets assume storage is available, but asynchronous volume provisioning can delay binding. The pod is scheduled but can't mount its volume, leading to unrecoverable state.
Diagnosis:
1. Check pod status:
kubectl describe pod rabbitmq-2
2. Check PVC status:
kubectl get pvc rabbitmq-data-rabbitmq-2
kubectl describe pvc rabbitmq-data-rabbitmq-2
3. Check PV status:
kubectl get pv | grep rabbitmq
Root cause: PVC bound to PV but PV not yet mounted on the node. The pod is crashing because /var/lib/rabbitmq is empty.
Immediate recovery:
Option A: Force pod rescheduling:
kubectl delete pod rabbitmq-2
Option B: Manually mount the PV on the node:
ssh node-3
sudo mount /dev/xvdb /mnt/rabbitmq-storage
kubectl delete pod rabbitmq-2
Prevention: Use WaitForFirstConsumer volumeBindingMode
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rabbitmq-storage
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # KEY: Bind at pod scheduling time
Better StatefulSet configuration:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
spec:
podManagementPolicy: OrderedReady # Ensures sequential startup
template:
spec:
initContainers:
- name: verify-storage
image: busybox:latest
command:
- sh
- -c
- |
while [ ! -d /var/lib/rabbitmq ]; do
echo "Waiting for volume mount..."
sleep 2
done
df -h /var/lib/rabbitmq
volumeMounts:
- name: rabbitmq-data
mountPath: /var/lib/rabbitmq
volumeClaimTemplates:
- metadata:
name: rabbitmq-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: rabbitmq-storage
resources:
requests:
storage: 50Gi
Monitoring and alerting:
- alert: StatefulSetPodPending
expr: kube_pod_status_phase{phase="Pending"} > 300
annotations:
summary: "RabbitMQ pod stuck in Pending for 5min"
Follow-up: How do you design a StatefulSet that's resilient to storage provisioning delays? What's the maximum tolerable startup time and how do you monitor it?
Your Kafka broker StatefulSet (kafka-0, kafka-1, kafka-2) each has a 500GB PV. Due to a misconfiguration, all three PVs are accidentally deleted from your storage backend (EBS in AWS). Kubernetes still has PVC/PV objects, but the underlying data is gone. You have no backup. Your Kafka cluster is completely down. How do you recover and what can you save?
This is a catastrophic scenario: the underlying storage is gone, PVs are orphaned, and Kubernetes objects point to non-existent data. Recovery depends on whether you have consumer group offsets stored elsewhere.
Diagnosis and assessment of damage:
1. Verify PVs are orphaned:
kubectl get pv | grep kafka
kubectl describe pv kafka-pv-0 | grep volumeHandle
2. Confirm volumes deleted:
aws ec2 describe-volumes --volume-ids vol-xxxxx
# Error: InvalidVolume.NotFound = data is gone
3. Check Kafka cluster state:
kubectl port-forward kafka-0 9092:9092 &
kafka-broker-api-versions.sh --bootstrap-server localhost:9092
Recovery strategy (depends on acceptable data loss): Create new EBS volumes from snapshots:
SNAPSHOT_IDS=("snap-xxxxx" "snap-yyyyy" "snap-zzzzz")
for i in 0 1 2; do
aws ec2 create-volume --snapshot-id ${SNAPSHOT_IDS[$i]} --availability-zone us-east-1a --volume-type gp3
done Update PV objects:
kubectl patch pv kafka-pv-0 -p '{"spec":{"awsElasticBlockStore":{"volumeID":"vol-newid"}}}' Restart Kafka:
kubectl delete pod kafka-0 kafka-1 kafka-2
Scenario A: Complete recovery with zero data loss (requires backups)
1. Verify backup exists:
aws s3 ls s3://kafka-backups/
Scenario B: Rebuild cluster without data kubectl delete statefulset kafka
kubectl delete pvc --all -l app=kafka Create new cluster with new PVs:
kubectl apply -f kafka-statefulset-new.yaml Recreate topics from external list:
cat topics.txt | while read topic; do
kafka-topics.sh --bootstrap-server kafka-0:9092 --create --topic $topic --partitions 3 --replication-factor 3
done
1. Delete orphaned resources:
Long-term prevention: kind: CronJob
metadata:
name: kafka-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: amazon/aws-cli
command:
- /bin/bash
- -c
- |
for vol in $(aws ec2 describe-volumes --query 'Volumes[?Tags[?Key==
1. Automated backup strategy:
apiVersion: v1
app].Value==kafka].VolumeId' --output text); do
aws ec2 create-snapshot --volume-id $vol --description "Kafka backup $(date +%Y-%m-%d)"
done
2. Use persistent, multi-zone storage
3. Enable audit logging to detect accidental deletions
4. Implement GitOps for topic definitions
Follow-up: Design a disaster recovery plan for Kafka that survives complete cluster destruction. What's your backup strategy and RTO/RPO targets?
Your ZooKeeper StatefulSet is using a local storage (hostPath) for state. A zk-0 pod crashes and is rescheduled to a different node. The new zk-0 has an empty local PV and restarts as a fresh node (no prior state). The quorum is broken because zk-0 doesn't know about previous commits. The cluster becomes unavailable. How do you design persistent state for ZooKeeper?
ZooKeeper requires persistent state across pod restarts. Using hostPath (local storage) creates exactly this problem: data is tied to a node and lost on reschedule.
Problem with hostPath:
- Pod crash on node-1 → rescheduled to node-2
- New pod has empty local storage (node-1's data is inaccessible)
- zk-0 joins quorum as new member, loses all state
- Cluster data inconsistency or quorum loss
Solution: Network storage (EBS, NFS, cloud-native)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: zookeeper-storage
provisioner: ebs.csi.aws.com
parameters:
type: gp3
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zookeeper
spec:
serviceName: zookeeper
replicas: 3
selector:
matchLabels:
app: zookeeper
template:
spec:
containers:
- name: zookeeper
image: zookeeper:3.8
ports:
- containerPort: 2181
- containerPort: 2888
- containerPort: 3888
env:
- name: ZK_SERVER_HEAP
value: "1024"
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper/data
- name: datalog
mountPath: /var/lib/zookeeper/datalog
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: zookeeper-storage
resources:
requests:
storage: 10Gi
- metadata:
name: datalog
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: zookeeper-storage
resources:
requests:
storage: 5Gi
When pod rescheduled: PVs follow pod to new node, ZooKeeper data is preserved.
Prevention and monitoring:
- Use network storage for all StatefulSet workloads
- Monitor ZooKeeper quorum health
- Alert on quorum loss
- Test pod rescheduling regularly
Follow-up: How do you handle multi-region StatefulSet state when storage is single-region?