Kubernetes Interview Questions

etcd Backup and Disaster Recovery

questions
Scroll to track progress

You're on-call at 2 AM. Your monitoring alerts: etcd cluster lost quorum. Two of three nodes are down, and the third node shows stale reads. kubectl get nodes hangs. Walk through your incident response in the first 15 minutes.

First, check the immediate state without relying on API server:

etcdctl --endpoints=https://etcd-node-3:2379 \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ member list

This confirms which members are alive. If less than 2 of 3 are responsive, quorum is lost.

Check the etcd data directory size and WAL logs:

du -sh /var/lib/etcd/member/

Look for corruption or excessive logging:

etcdctl --endpoints=https://etcd-node-3:2379 ... endpoint health

DO NOT restart etcd yet. Instead:

1. Snapshot the healthy node: etcdctl snapshot save /backup/etcd-$(date +%s).db

2. Check the snapshot: etcdctl snapshot status /backup/etcd-*.db

3. Verify other two nodes aren't booting (check systemctl status etcd or container logs)

4. If both are completely dead, force-restore from snapshot on the healthy node and restart the cluster

Production path: Restore from your automated nightly etcd backup (stored in S3 or external repo), then remove the failed members and add new ones.

Follow-up: How do you rebuild the cluster after the two dead nodes are restarted? Walk me through member add/remove.

Your automated etcd backup script failed for 3 days undetected. You have a point-in-time snapshot from 72 hours ago. Your cluster is now corrupted and losing data. How do you restore without losing the last 12 hours of good workload state?

This is a partial-loss scenario. You cannot recover what was never backed up, but you can minimize further damage:

1. Stop all etcd nodes immediately to prevent further divergence

2. Verify the snapshot is healthy:

etcdctl snapshot status /backup/etcd-72h-ago.db

3. DO NOT restore yet. Instead, check if the APIServer cached a list of recent objects. Query your monitoring system (Prometheus) for the last known state of Deployments, Services, PVCs

4. Restore one member at a time from the snapshot:

rm -rf /var/lib/etcd/member/ etcdctl snapshot restore /backup/etcd-72h-ago.db \ --data-dir=/var/lib/etcd \ --name=etcd-1 \ --initial-cluster=etcd-1=https://10.0.1.10:2380,etcd-2=https://10.0.1.11:2380,etcd-3=https://10.0.1.12:2380 \ --initial-advertise-peer-urls=https://10.0.1.10:2380

5. Start etcd and verify it's healthy before starting APIServer

6. Reconcile missing objects: redeploy any Deployments, Services, or ConfigMaps that were created in the last 72h by reapplying them from your IaC (Terraform, Helm, GitOps repo)

Real teams mitigate this with: hourly snapshots to S3, automated snapshot verification, and Kubernetes Event API audit logs stored separately.

Follow-up: How would you structure your etcd backup automation to catch backup failures automatically? What metrics would you monitor?

Your etcd cluster is running out of space. The server is compacting but still consuming 14GB. df shows 800MB free. You're 60 minutes from catastrophic failure. What's your immediate action?

etcd can autofragment. You're in a space crisis:

1. First, check fragmentation ratio:

etcdctl --endpoints=https://localhost:2379 ... \ --write-out=table defrag --command=status

2. Manually trigger defragmentation on all three nodes (one at a time):

etcdctl --endpoints=https://etcd-1:2379 ... defrag etcdctl --endpoints=https://etcd-2:2379 ... defrag etcdctl --endpoints=https://etcd-3:2379 ... defrag

This can take 30-90 seconds per node and briefly locks the database.

3. Immediately scale up the etcd volume in your cloud provider (AWS/GCP/Azure) and remount

4. Check if revision history is bloated:

etcdctl revision

If revision is > 10M, enable auto-compaction in etcd config:

--auto-compaction-retention=24h --auto-compaction-mode=revision

5. Force a revision compaction:

etcdctl --endpoints=https://localhost:2379 ... compact 999999

Permanent fix: Monitor etcd disk usage with Prometheus (etcd_disk_backend_commit_duration_seconds_bucket, etcd_db_total_size_in_bytes) and alert at 70% capacity. Scale volumes before you hit crisis.

Follow-up: What's the difference between revision-based and duration-based auto-compaction? When would you use each?

You're restoring a snapshot to a new etcd cluster for disaster recovery testing. The restore succeeds, but after 5 minutes, the new cluster splits brain and each member thinks it's the leader. How do you debug this?

Split-brain with restored snapshots usually means cluster discovery/configuration mismatch:

1. Check each member's configuration against the restore command:

etcdctl --endpoints=https://etcd-1:2379 ... member list -w json | jq '.members[] | {name, clientURLs, peerURLs}'

2. Verify the initial-cluster and initial-advertise-peer-urls match the network topology of your restore environment

3. Check the member IDs. After snapshot restore, if you're restoring to different IPs or hostnames, the peer discovery breaks:

# This is wrong if IPs don't match restore target: etcdctl snapshot restore ... \ --initial-cluster=etcd-1=https://OLD_IP:2380,...

4. Rebuild cleanly: Remove the corrupted member directory and re-snapshot restore with correct IPs:

rm -rf /var/lib/etcd/member etcdctl snapshot restore /backup/etcd.db \ --data-dir=/var/lib/etcd \ --initial-cluster=etcd-1=https://NEW_IP_1:2380,etcd-2=https://NEW_IP_2:2380,... \ --initial-advertise-peer-urls=https://NEW_IP_1:2380

5. Check logs for peer discovery failures:

journalctl -u etcd -n 100 | grep -E "failed.*connect|dial|member"

6. Verify network: ping and nc -zv each peer URL to confirm connectivity before starting etcd

Follow-up: If you're restoring to a completely different network (e.g., different VPC), how would you handle cluster membership changes?

You have a backup strategy: take etcd snapshot every hour, store in S3. But during a major incident, you realize the backups are corrupted—they restore but have silent data loss (objects exist in kubectl but not in etcd). How do you detect this before it's a disaster?

Silent corruption is insidious. You need continuous verification:

1. Automated backup validation (run hourly after snapshot):

# Verify snapshot structure etcdctl snapshot status /tmp/latest-backup.db

Count keys in backup

etcdctl snapshot status /tmp/latest-backup.db | grep -oP 'Hash: \K[0-9a-f]+'

2. Compare hash with production etcd:

# Get production hash etcdctl --endpoints=https://etcd-1:2379 … endpoint hash

If hashes differ, backup is stale or corrupted

3. Implement point-in-time test restores in a non-prod cluster weekly:

# Restore last week’s backup to DR cluster

Verify all namespaces present: kubectl get ns

Verify key workloads: kubectl get deploy -A

Check specific resource counts haven’t changed unexpectedly

4. Audit log verification: Store etcd audit logs separately (not in etcd). Compare audit log events with actual objects:

# Count Create events vs existing objects kubectl get deploy -A -o json | jq '.items | length'

Should roughly match audit log Create events for Deployment kind

5. Implement etcd data integrity checks using tools like etcd-io/website, or build a custom verification webhook that calculates object checksums

Best practice: Weekly drill restores in staging with full validation, monthly full-cluster restore DR test.

Follow-up: Design a monitoring dashboard that alerts on backup staleness, corruption risk, and restoration failure. What metrics and thresholds would you track?

You're upgrading etcd from 3.4 to 3.5 and need zero downtime. Your cluster is production with ~500K keys. Walk through your upgrade path, including backup and rollback strategy.

etcd upgrades are high-risk. Do not rush:

1. Pre-upgrade snapshot (to current version for rollback):

etcdctl --endpoints=https://localhost:2379 ... snapshot save /backup/pre-upgrade-3.4.db

2. Rolling upgrade, one member at a time (keep 2/3 running):

# Node 1: Stop etcd, backup member dir, drain data systemctl stop etcd cp -r /var/lib/etcd/member /backup/etcd-member-node1-backup

Download etcd 3.5 binary

cd /tmp && wget https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz tar xz && sudo cp etcd-v3.5.0-linux-amd64/etcd /usr/local/bin/

Start new version (cluster should auto-migrate)

systemctl start etcd

Verify quorum is healthy (still 2 members voting, 1 upgrading)

etcdctl --endpoints=https://etcd-2:2379,https://etcd-3:2379 … member list

3. Repeat for nodes 2 and 3, always keeping quorum

4. Verify cluster health after full upgrade:

etcdctl endpoint health etcdctl member list etcdctl check perf

5. If rollback needed: Stop all nodes, restore old member dirs from backup, downgrade binary, restart. Only do this within 24 hours of upgrade.

Risk mitigation: Do a test upgrade on a staging cluster first. Have rollback runbook tested and timed.

Follow-up: How would you handle a major etcd version jump (3.4 → 3.6) where data format changes? Would your rolling upgrade still work?

You discover that your etcd cluster is leaking memory. It's consuming 18GB on a node with 32GB total. The APIServer is throttling. Restarting etcd fixes it for 2 weeks, then it leaks again. How do you find the root cause?

Memory leaks in etcd are usually watchlist explosion or unbounded client connections:

1. Check active watches and client connections:

etcdctl --endpoints=https://localhost:2379 ... alarm list etcdctl watch --prefix / --rev=1 # This can spike memory if many keys change

Check metric: etcd_debugging_mvcc_watcher_total_count

2. Look at etcd metrics via Prometheus endpoint (:2379/metrics):

curl https://localhost:2379/metrics 2>/dev/null | grep -E "etcd_debugging_mvcc|etcd_memory|process_resident_memory"

Key metrics to check:

etcd_debugging_mvcc_keys_total etcd_debugging_mvcc_db_total_size_in_bytes etcd_debugging_store_expires_total process_resident_memory_bytes

3. If memory grows unbounded despite low key count, check for watch leaks. Find clients opening watches without closing:

netstat -anp | grep :2379 | grep ESTABLISHED | wc -l

Compare to expected number of APIServer replicas + monitoring agents

4. Enable etcd debug logging to see watch operations:

# In etcd config: –log-level=debug --log-outputs=stderr

Grep for Watch operations and correlate with memory spike

5. Common culprits: Kubernetes-dashboard, monitoring scrapers, custom controllers with buggy watch implementations. Patch or isolate these clients.

6. For immediate relief: Set watch limits in etcd startup:

–max-watcher-size=10000 # Prevents single client from watching everything

Follow-up: How would you implement alerting for memory usage patterns that indicate a leak vs. normal seasonal growth?

You're running etcd in a Kubernetes StatefulSet with PVCs. A node fails, and the StatefulSet tries to reschedule the etcd pod, but the PVC is stuck in NodeAffinity limbo — it's bound to the dead node. Your cluster loses quorum while waiting for recovery. How do you recover manually?

This is a real scenario with etcd-in-Kubernetes clusters:

1. First, identify the stuck pod and PVC:

kubectl get pods -n etcd kubectl describe pvc -n etcd etcd-data-etcd-2 # Check volumeAttachments for stale references

2. Force the PVC to detach from the dead node:

kubectl delete volumeattachment -n etcd csi-xxx-node-down # Or manually delete node from cluster if it's truly dead kubectl delete node node-xxx

3. Remove the node affinity lock on the PVC if needed (dangerous, only if sure node won't recover):

kubectl patch pvc etcd-data-etcd-2 -n etcd -p '{"spec":{"nodeName":null}}'

4. Now the StatefulSet should reschedule the pod to a healthy node

5. BUT: The new pod will likely have an empty PVC (or old data). If quorum was lost, force the new etcd pod to join as a new member:

kubectl exec etcd-0 -n etcd -- etcdctl member add etcd-2 --peer-urls=https://etcd-2:2380 kubectl patch sts etcd -n etcd -p '{"spec":{"template":{"spec":{"containers":[{"name":"etcd","env":[{"name":"ETCD_INITIAL_CLUSTER_STATE","value":"existing"}]}]}}}}'

6. For production: Use etcd-operator or ETCD-on-K8s tools that handle this automatically. They manage member lifecycle and recovery.

Follow-up: Compare running etcd as a StatefulSet vs. external etcd cluster. What are the trade-offs in disaster recovery complexity?

Want to go deeper?