Kafka Interview — Broker Failure and Leader Election

Broker Failure & Leader Election

One of your 5 Kafka brokers dies (broker 2). It hosted the leader for 3 partitions. Walk through exactly what happens in the next 30 seconds and what your producers/consumers see.

Timeline of broker 2 failure: T=0: Broker 2 crashes. Producers and consumers connected to broker 2 get connection timeout. They reconnect to other brokers. T=0-3s: Controller (usually broker 0) detects broker 2 is dead (no heartbeat for 6s by default). T=3-6s: Controller triggers leader election for 3 partitions. For each partition: selects new leader from ISR (excluding dead broker 2). T=6-9s: New leaders commit state change to __cluster_metadata topic (Kafka 3.0+) or send LeaderAndISR request to all brokers. T=9-12s: Brokers update metadata. Producers refresh metadata and resend buffered messages to new leader. T=12-30s: Messages replicate to followers. Consumers: if consuming from broker 2's partitions, they get NotLeaderException. Consumer refreshes metadata and reconnects to new leader. Lag increases during this window (messages queue while rebalancing). Producers: buffered messages timeout if election takes >30s (default request timeout). Run kafka-topics --describe --bootstrap-server localhost:9092 to see leadership changes. Check controller logs: grep "Leader Election" /var/log/kafka/*.log. Monitor JMX: kafka.controller:type=KafkaController,name=UncleanLeaderElectionsPerSec—if >0, ISR was empty and unclean election happened (data loss risk).

Follow-up: If broker 2 was the controller, does election time double?

You have ISR=[0,1,2] for a critical partition. Broker 0 dies unexpectedly. ISR shrinks immediately to [1,2]. But broker 1's disk is 90% full and is lagging. Can broker 1 become the new leader?

Yes, broker 1 will become the new leader (first in remaining ISR [1,2]). Leadership is determined by ISR membership, not disk space or lag. This is intentional: Kafka prioritizes high availability. If Kafka refused leaders with full disks, you'd lose all partitions (bad). The right solution: immediately add disk space to broker 1 before it fails. Or drain broker 1: kafka-configs --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 --alter --add-config 'leader.replication.throttle.rate=10485760' to slow replication and buy time. Better: proactively reassign partitions off broker 1 to healthy broker: kafka-reassign-partitions. But in emergency: broker 1 is new leader and will accept writes. Producers succeed, consumers continue. The risk: if broker 1's disk fills up, writes will fail. Monitor immediately: df -h /var/kafka on broker 1. If >95%, emergency: scale up disk or move partitions. JMX alert: kafka.server:type=KafkaRequestHandlerPool,name=AvgIdlePercent < 10% = broker saturated. Also monitor kafka.disk:type=LogFlushStats,name=LogFlushRateAndTimeMs—if latency spikes, disk is slow.

Follow-up: If broker 1 becomes leader but then loses connection from the controller, can it still accept writes?

Network partition: broker 2 is isolated from the cluster (can't reach controller or other brokers). Broker 2 still thinks it's the leader for 2 partitions. Producers keep sending to broker 2. Are writes durable?

Writes to broker 2 succeed locally but are not replicated. If broker 2's network recovers, Kafka detects split-brain: controller realizes broker 2 was partitioned, broker 2 had stale leadership. Controller revokes leadership on broker 2 (fences it out). Broker 2 flushes its log and goes into read-only mode. Any writes made during isolation are discarded—data loss. Producers on broker 2 side get a timeout and fail over to majority-side cluster. To prevent: set unclean.leader.election=false (default in 2.0+). This prevents brokers with stale metadata from becoming leaders. Monitor: run mtr broker0 broker2 and ping broker0 broker2 to detect network issues. Alert on: kafka.server:type=ReplicaFetcherManager,name=MaxLag > 1M (indicates partition isolation). Also monitor: kafka.controller:type=KafkaController,name=OfflinePartitionsCount > 0 = partitions have no leader (bad). If broker 2 is isolated: manually shutdown broker 2 to prevent stale writes. Then investigate network. If network recovers, restart broker 2—it will follow new leader and discard isolated writes. Recommended: multi-AZ deployment so network partition affects only 1 broker, not entire side.

Follow-up: How does Kafka prevent the isolated broker from keeping its stale leadership even after it reconnects?

You have 5 brokers. Brokers 0 and 1 die simultaneously. Replication factor is 3. ISR was [0,1,2]. Can Kafka still elect a leader?

ISR was [0,1,2], now only broker 2 is alive. Kafka requires majority of ISR to elect leader. With 3 ISR members and 2 dead, there's no majority (1 < 2). Kafka enters "no leader" state. Partitions go offline. Producers get NotLeaderException. Consumers can't read. This is a catastrophic failure. Only recovery: restart one of brokers 0 or 1. Once a majority (2 of 3) is alive, leader election proceeds. To prevent: use replication factor ≥ 4 for critical topics. Or set unclean.leader.election=true (dangerous!—allows non-ISR members to become leader, risking data loss). In real world: this shouldn't happen because your brokers are in different racks/AZs. If 2 brokers died together, either it's a cascading failure (bug) or a widespread outage (e.g., datacenter failure). Mitigation: set min.insync.replicas=2. When 2 brokers die, producers block (expected), but cluster doesn't go offline. Once broker recovers, min.insync is satisfied again and writes resume. Monitor: alert if OfflinePartitionsCount > 0. If this alert fires after broker failure, escalate to on-call immediately.

Follow-up: If you increase replication factor to 4 for one partition, is the ISR automatically [0,1,2,3] or does it depend on replica assignment?

Broker 0 is the leader for a partition. It crashes. Broker 1 becomes new leader. Broker 0 recovers. Will it automatically rejoin as follower or do you need to reassign?

Broker 0 automatically rejoins as a follower without reassignment. When broker 0 restarts, it reads its local log, fetches current ISR from broker 1 (new leader), and syncs. Once caught up, it joins ISR automatically. No manual intervention needed. This is the beauty of Kafka's ISR model. Timeline: T=0: Broker 0 crashes. T=10s: Broker 1 becomes leader (ISR=[1,2]). T=30s: Broker 0 restarts. T=30-45s: Broker 0 fetches metadata from broker 1, reads replica log, starts fetching from leader to catch up. T=45-60s: Broker 0 catches up (lag < replica.lag.time.max.ms), rejoins ISR. ISR is now [0,1,2] again. To verify: run kafka-topics --describe and watch ISR column. After broker restart, ISR grows back. If ISR doesn't grow within 5 minutes, something is wrong: check broker 0 logs for errors, verify network, check disk. Performance note: during catch-up (T=30-60s), broker 0 consumes significant disk I/O and network bandwidth. If it's a large partition, catch-up takes longer. Can slow cluster. Monitor: track replica.lag via JMX during broker restart. If lag > 1M and stuck, investigate broker 0 health.

Follow-up: If broker 0 takes 2 hours to catch up after restart, what's the impact on write durability?

You're running a Kafka cluster in Kubernetes. Broker pod crashes. Kubernetes restarts it. Broker gets a new IP. It has the same broker.id and hostname (DNS). Does it rejoin the cluster correctly?

Yes, as long as broker.id and hostname are preserved. Kafka uses broker.id (not IP) to identify brokers. On restart, broker reads broker.id from meta.properties in the data directory (persisted on PVC). Broker registers with controller using hostname (DNS), which resolves to new IP. Controller recognizes same broker ID and updates leadership/ISR accordingly. In Kubernetes: use StatefulSets with persistent volumes for /var/lib/kafka/data. StatefulSet ensures broker.id (derived from pod name) and DNS hostname (pod-name.headless-service) are consistent across restarts. Broker recovers naturally. But: if pod restarts with old hostname but new pod in cluster, two issues arise. Fix: 1) Use NodePort or headless service (DNS name stable). 2) Ensure data volume persists (PVC), not ephemeral. 3) Set broker.rack= for cross-AZ replication. Command to verify broker rejoined: kafka-broker-api-versions --bootstrap-server localhost:9092 --describe. Check for broker ID in output. If cluster sees new broker ID, something went wrong (likely new PVC). To recover: delete pod and PVC, let StatefulSet recreate with proper data. This forces catch-up from scratch. Or restore PVC snapshot if available.

Follow-up: If a pod gets a new PVC after restart, can the old PVC's data be recovered?

You run a leadership election. Controller selects broker 3 as new leader. Broker 3 accepts the LeaderAndISR request. Then broker 3's network link drops for 10 seconds. During those 10 seconds, what's the state of ISR?

ISR is in a transitional state. Broker 3 received LeaderAndISR and is now the leader. It starts accepting writes. Followers (brokers 1, 2) haven't yet learned broker 3 is the leader (because broker 3's link is down, no UpdateMetadata broadcast). From followers' perspective, old leader is still in charge. When broker 3's link recovers, it sends UpdateMetadata to all brokers. Followers learn new leader, adjust replication, and ISR stabilizes. If broker 3 crashes during network down, followers stay connected to old leader. Controller detects no response from broker 3, revokes leadership, elects new leader from ISR. Writes made to broker 3 during isolation are discarded. To prevent: monitor broker network latency. Alert if latency > 100ms or packet loss > 0.1%. Use mtr tool. Also monitor: kafka.server:type=BrokerTopicMetrics,name=AllProduceRequestsPerSec—if rate drops suddenly, broker might be isolated. Recovery action: if broker stays isolated >30s, kill it: pkill -f kafka.Kafka. This forces controller to revoke leadership and elect new leader from other brokers. Once fixed, restart broker. Don't let it sit isolated for >1 min.

Follow-up: If broker 3's network recovers but controller fails at same time, who's in charge of electing new leader?