Partition Replication & In-Sync Replicas
You wake up to an alert: ISR for your critical orders topic shrunk from [0,1,2] to [0]. What does this mean, what broke, and should you wake up the on-call?
ISR (In-Sync Replicas) shrinking to 1 means follower brokers can't keep up with the leader. This is bad for availability. With only 1 replica, if that broker dies, you lose data (depending on acks setting). Immediate diagnosis: check broker logs for OutOfSyncReplicas. Run kafka-topics --bootstrap-server localhost:9092 --topic orders --describe to see replica status. Check JMX metric kafka.server:type=ReplicaManager,name=IsrShrinks to see shrink events. Causes: 1) Follower broker is overloaded (disk I/O, network saturation, GC pause). 2) Network latency between brokers >default replica.lag.time.max.ms (10s). 3) Disk full on follower. 4) Follower crashed/restarted. Immediate action: check disk space (df -h) and broker CPU/memory. If follower is healthy, wait 5-10 minutes for it to catch up. If it doesn't recover, check network connectivity: ping broker2 and telnet broker2 9092. If still broken, restart the lagging broker. This is worth waking up on-call because: with min.insync.replicas=2 (required for acks=all), producers will block and orders stop flowing. Priority: restore ISR to 3 ASAP. Monitor: set alert if ISR < replication_factor for >5 minutes.
Follow-up: Your ISR recovered to [0,1,2] but then immediately shrunk to [0] again. What's the likely root cause?
Your replication factor is 3. You have 5 brokers. ISR is [0,1,2]. Broker 0 (leader) dies. Who becomes leader? What's the new ISR?
When the leader dies, Kafka elects a new leader from remaining ISR members. Since ISR=[0,1,2] and broker 0 died, new leader is broker 1 (first alive replica in ISR). New ISR=[1,2] (removes dead broker 0). New replicas try to replicate from broker 1. Important: if you had configured unclean.leader.election=true and all ISR members died, Kafka would pick the first replica that recovers (even if not in ISR), risking data loss. Default is unclean.leader.election=false (safer, but brokers must keep ISR alive). To verify: run kafka-topics --bootstrap-server localhost:9092 --topic orders --describe and check Leader column. Check JMX for kafka.controller:type=KafkaController,name=OfflinePartitionsCount—if >0, some partitions have no leader (bad). To recover broker 0: restart it. It will rejoin ISR automatically once it catches up to the new leader (broker 1). Lag time depends on data volume. Until then, replication factor is lower (only 2 copies instead of 3). If broker 0 doesn't come back, add new broker 5 and rebalance: kafka-reassign-partitions --bootstrap-server localhost:9092 --topics-to-move-json-file topics.json --broker-list 1,2,5 --generate.
Follow-up: If broker 0 stays down for 2 days, can you safely remove it from the cluster? What changes?
You set min.insync.replicas=2 and replication_factor=3. ISR shrinks to [0,1]. Producers with acks=all suddenly hang. Why can't they write?
acks=all means the leader waits for acknowledgment from all ISR members before returning success to producer. With ISR=[0,1] and min.insync.replicas=2, requirement is met (2 replicas in ISR). So producers should NOT hang—they should succeed. But if ISR shrinks to [0] (only 1 replica) and min.insync.replicas=2, producers with acks=all will block because ISR < min.insync.replicas. The leader cannot satisfy the requirement. Producers get timeout error. This is a safety mechanism: acks=all + min.insync.replicas=2 ensures at least 2 copies before write is considered durable. Diagnosis: check ISR status and broker health. If broker 2 is lagging, restart it or investigate I/O. Once ISR recovers to ≥2, producers unblock immediately. To prevent: set min.insync.replicas=1 (less durable) or monitor ISR shrinks and page on-call. In production, set min.insync.replicas=replication_factor-1 for balance: RF=3 → min.insync=2. This allows 1 broker to die without losing durability. Metric to track: kafka.server:type=BrokerTopicMetrics,name=AllProduceRequestsPerSec,topic=orders combined with producer timeout rate.
Follow-up: If you have 5 brokers, replication_factor=5, and ISR shrinks to [0,1,2], how many brokers must die before producers block with min.insync.replicas=2?
You misconfigured a topic: replication_factor=1 (accidentally). 1000 critical transactions were written before you noticed. Can you recover?
With replication_factor=1, there's only 1 copy of data. If that broker dies, data is gone. If broker is alive, all messages are safe. If you discovered the mistake immediately and broker didn't crash, you're OK—increase replication factor: kafka-reassign-partitions --bootstrap-server localhost:9092 --topics-to-move-json-file increase-rf.json --generate. Create increase-rf.json: {"topics": [{"topic": "orders", "partition": 0, "replicas": [0, 1, 2]}], ...}. Execute the reassignment: kafka-reassign-partitions --bootstrap-server localhost:9092 --reassignment-json-file increase-rf.json --execute. Monitor: kafka-reassign-partitions --bootstrap-server localhost:9092 --reassignment-json-file increase-rf.json --verify. During reassignment, leader copies data to new replicas. This is I/O intensive and may slow production. Estimate time: (partition_size_gb) × (network_speed) / (available_bandwidth). If broker dies before reassignment completes, unsynced replicas are lost. Once RF=3, those 1000 transactions are now replicated to 3 brokers. Permanent fix: set default.replication.factor=3 in broker config to prevent accidents. Add alert: if any topic has RF=1, page on-call immediately.
Follow-up: During a reassignment from RF=1 to RF=3, if the original broker dies, what happens to the replica on the new brokers?
Your ISR is [0,1,2]. Broker 1 dies. ISR becomes [0,2]. Before broker 1 recovers, you scale to 8 brokers. Broker 1 comes back online. Will it rejoin the ISR?
Yes. Broker 1 will attempt to sync with the leader (broker 0). It checks its last offset and compares with leader's committed offset. If broker 1's data is not too old (within replica.lag.time.max.ms, default 10s), it rejoins ISR. If broker 1 was offline for >10s and leader advanced, broker 1 is out of sync. It will sync with leader (catch-up replication) and then rejoin ISR once caught up. Key point: ISR membership is dynamic. Brokers join/leave based on sync status, not cluster membership. Scaling to 8 brokers doesn't affect this partition's replicas (still [0,1,2])—only new partitions on topics will use brokers 3-7. To verify broker 1 rejoined: run kafka-topics --describe and check ISR. If it shows [0,2] minutes later, broker 1 is stuck. Diagnose: check broker 1 logs for "OutOfSyncReplicas", verify network connectivity, check disk. If broker 1 can't recover, remove it from the replica set: kafka-reassign-partitions and move replica to broker 3-7. Monitor: set alert if ISR doesn't recover within 5 minutes of broker restart.
Follow-up: If broker 1 is permanently lost, how do you know it's safe to shrink RF from 3 to 2?
You're on a call with the database team. They ask: "Can you make Kafka behave like PostgreSQL replication?" They want synchronous replicas where a write blocks until replicas ack. You have acks=all and min.insync.replicas=2. Is this the same?
Almost, but not exactly. acks=all + min.insync.replicas=2 ensures 2 brokers have acknowledged a write before the producer succeeds. This is similar to PostgreSQL's synchronous mode. Key differences: 1) Kafka's ISR can shrink if a replica falls behind—if ISR drops below min.insync.replicas, writes block (not transparent to producer unless timeout). 2) PostgreSQL writes to all defined standby locations; Kafka writes to leader + 1 follower (ISR members). 3) Kafka leader can continue writing if standby dies (producer gets timeout); PostgreSQL would halt replication. Similarity: both guarantee N copies before ack. To make Kafka feel like PostgreSQL: set RF=3, min.insync.replicas=2, acks=all, and replica.lag.time.max.ms=5000 (tight lag window). This ensures leader and 2 followers are always in sync. If any follower stalls, ISR shrinks and producers block (like PostgreSQL waiting for standby). Downside: Kafka becomes less available (like PostgreSQL). Upside: strong durability. Trade-off: set min.insync.replicas=RF-1 (use 2 of 3) for balance.
Follow-up: If you want Kafka to tolerate 2 broker deaths, what's the minimum replication factor and min.insync.replicas?
You're shipping orders to a warehouse via Kafka. Topic has RF=2. You run 5 brokers but only 3 have replicas for the orders partition. A broker dies. ISR is still valid. Should you panic?
No panic if ISR is still [replica1, replica2] and both brokers are alive. The orders partition has 2 replicas on 2 of your 5 brokers. The other 3 brokers don't have replicas—that's fine, it's intentional. One broker dying removes its replica if it was in ISR. At that point, ISR shrinks to 1 copy. This is the risk with RF=2 and only 3 brokers hosting replicas: you have only 1 failure to spare. Better: increase RF to 3 (spread across 5 brokers). But first, check why only 3 brokers have replicas. Was the partition manually assigned? Or is there a constraint (e.g., rack awareness, broker failure, reassignment incomplete)? Run kafka-topics --describe to see replica assignment. If underutilized, you're wasting capacity. To redistribute: use kafka-reassign-partitions to move one replica to an unused broker. This spreads load and improves fault tolerance. Monitor: set alert if any partition has RF < broker_count/2. If ISR stays below RF for >10 minutes, escalate.
Follow-up: You have RF=3 but only 2 brokers in ISR. Why would you NOT immediately fail over?
Your cluster runs with default settings. A network partition isolates broker 0 from brokers 1-4. Leader was on broker 0. What happens to producers and consumers? Can both sides continue?
Network partition causes a "split brain" scenario. Broker 0 (isolated) thinks it's still leader and accepts writes. Brokers 1-4 can't reach broker 0, so they elect new leader from ISR (broker 1). Now you have 2 leaders: broker 0 (old, isolated) and broker 1 (new). Producers on broker 0 side succeed and keep writing. Producers on broker 1 side also succeed (different messages). Consumers split too—some follow broker 0, some follow broker 1. When partition heals, data is split: some messages only on broker 0, some on broker 1. Kafka resolves this: broker 0 loses all writes made during isolation (they're discarded). Broker 1's writes survive (majority wins). This is why Kafka uses quorum-based consensus (ISR) to prevent split brain. To prevent: ensure RF ≥ 3 and min.insync.replicas = RF-1. With RF=3, a partition isolating 1 broker can't form a quorum, so no new leader is elected on isolated side. Only majority-side (2+ brokers) continues. Set unclean.leader.election=false to disallow non-ISR leaders. Monitor network: run mtr broker0 broker1 to detect packet loss. If network partition >30s, investigate immediately (likely hardware failure or cloud network issue).
Follow-up: If broker 0 is isolated and RF=3, why can't broker 0 continue alone as a 1-broker ISR?