You're migrating a 50-broker Kafka cluster from ZooKeeper to KRaft. Producers and consumers are live with millions of msg/sec throughput. Walk through the migration process step by step, including risks and rollback strategy.
Step-by-step migration: (1) Enable KRaft controller nodes: Set up 3-5 dedicated controller nodes with role=controller and process.roles=controller. Configure controller.quorum.voters with all controller IPs/ports. (2) Migrate ZK to KRaft quorum: Start controller nodes with empty state. They will form a quorum and elect a leader. (3) Dual-write phase: Keep ZooKeeper running in parallel. Brokers register with both ZK and KRaft controller. Set zookeeper.connect=... and controller.quorum.voters=... on brokers. (4) Broker rolling restart: One by one, restart brokers to sync state from KRaft. Monitor broker startup time (should be <10s). (5) Cut over: Once all brokers are running with KRaft, disable ZK registration by removing zookeeper.connect config. (6) Decommission ZK: Shut down ZK cluster after stability window (24-48 hours).
Risks: If KRaft quorum loses majority (2+ of 3 controllers fail), new leaders can't be elected. Mitigate: Use 5 controllers, monitor quorum health continuously. Test failover before full migration.
Rollback strategy: If KRaft has issues, revert controller.quorum.voters config from broker configs, restart brokers to re-enable ZK communication. ZK retains full state from parallel write phase, so rollback is reversible.
LinkedIn migrated a 1000+ broker cluster with 24-hour full migration window. They used traffic shadowing to validate KRaft controller decisions before cutting over.
Follow-up: During the dual-write phase, if ZK and KRaft disagree on partition leaders, which one wins? How do you detect and resolve divergence?
Your KRaft cluster has 3 controllers. Controller 1 (leader) dies. How does leader election work? What's the impact on brokers and consumers during the transition?
KRaft uses Raft consensus for controller election. When the leader dies: (1) Remaining controllers (2, 3) detect timeout (default 9 seconds of no heartbeat). (2) One controller increments term and requests votes from others. (3) With 2/3 quorum agreement, new leader is elected. (4) New leader catches up followers by sending log entries and snapshots. (5) New leader sends metadata updates to brokers.
Broker impact: Brokers maintain a connection to the controller. During election, the connection may drop. Brokers wait for new metadata from new leader. If metadata.max.idle.ms=300000 (5 minutes), brokers will re-fetch metadata from the new leader. There's a 9-30 second window of high latency on metadata operations (topic creation, partition reassignment).
Consumer/producer impact: Minimal. They don't interact with the controller directly. They get metadata from brokers. As long as brokers are healthy, producers/consumers continue reading/writing. However, if the metadata cache is stale (e.g., leader election not yet seen by broker), first fetch might fail and retry after 1-2 seconds.
Observation: Netflix uses 5 KRaft controllers for production. They've observed that controller failover takes 15-30 seconds to propagate metadata to all 500 brokers. This is <1% of overall SLA.
Follow-up: If 2 of 3 controllers die simultaneously, the cluster becomes unavailable. How do you prevent this? What's the trade-off between controller count and resource cost?
You're running a 10-broker ZooKeeper-based cluster. ZK is consuming 20GB RAM and has constant GC pauses of 2-3 seconds. Migrating to KRaft will remove this dependency. Estimate storage and CPU cost of KRaft controllers vs. ZK.
ZooKeeper cost: 1GB ZK cluster (3 nodes) stores broker metadata, partition assignments, consumer group offsets (optional). RAM usage grows with number of topics/partitions. At 10K topics × 100 partitions = 1M entries, ZK uses 5-20GB. GC pauses happen because ZK snapshots memory to disk periodically.
KRaft cost: 3 KRaft controllers store metadata in append-only logs (similar to brokers). Storage is highly compressible: a broker registration is ~1KB, replicated 3x = 3KB per broker per lease renewal (every 5 min). For 10 brokers, expect 10MB/day append-only logs + snapshots. Controllers don't GC like ZK (they use event logs + snapshots). CPU: KRaft controllers run Raft consensus, ~5-10% per leader at scale.
Comparison: ZK: 20GB RAM, 2-3s GC pauses. KRaft: 500MB RAM per controller (3 total = 1.5GB), <10ms GC pauses (snappy/batching). Remove ZK entirely: save 3 VMs, reduce operational overhead.
Trade-off: KRaft controllers are simpler (no external ZK cluster), but every broker requires controller.quorum.voters config. ZK is decoupled (any ZK cluster works). For greenfield, KRaft is cheaper.
Follow-up: If you have a 500-broker cluster and need to store 10 years of broker state history for audit, how does KRaft's append-only log compare to ZK's snapshot + incremental updates model?
During KRaft migration, a broker reads stale metadata from the old ZooKeeper-based controller while the new KRaft controller has updated it. Explain the ordering of metadata propagation and how to detect/prevent divergence.
In dual-write setup, ZK and KRaft both accept metadata writes. Brokers might read from either. Risk: Broker A reads topic "events" has 10 partitions from ZK; Broker B (already migrated) reads from KRaft that "events" has 20 partitions (expansion not yet reflected in ZK). This causes producer to route to partition 15, which doesn't exist in ZK view.
Ordering guarantee: Kafka metadata has a version number. Every write (topic creation, partition add, etc.) increments the metadata version. Brokers compare versions: if broker_version < new_version, they re-fetch metadata from the controller.
Prevention: During dual-write phase, don't perform metadata changes (e.g., expand partitions). Keep topology stable. After full cutover to KRaft, allow metadata changes. This avoids divergence windows.
Detection: Monitor broker metadata versions: kafka-broker-api-versions.sh should report same metadata version across all brokers. If one broker lags, it's stale. Force re-fetch by restarting that broker.
Recovery: If divergence is detected, pause all producers/consumers until metadata converges. Set auto.create.topics.enable=false during migration to prevent accidental divergence.
Follow-up: How do you test for metadata divergence in a 100-broker cluster without affecting production? Can you run a validation layer?
A KRaft controller is consuming 40% CPU due to metadata propagation to 500 brokers. Each broker reconnects every 5 minutes, causing spike in load. Design a solution to reduce controller load.
Root cause: Each broker opens a connection to the controller and requests full metadata snapshot. With 500 brokers, that's 500 metadata snapshots per 5-minute interval. Each snapshot serializes all topics, partitions, replicas, ACLs—potentially 100MB+ per broker.
Optimization 1 - Batch metadata updates: Use metadata.batch.size.bytes=1MB to batch multiple metadata changes into one update push, reducing number of sends.
Optimization 2 - Incremental metadata: KRaft supports incremental metadata updates. Instead of sending full snapshot, send only changes since last version. Set enable.incremental.metadata=true` on brokers. Reduces per-broker payload from 100MB to 1-10MB.
Optimization 3 - Add more controllers: With 5 controllers (instead of 3), metadata is replicated 5x, but read load can be distributed. Brokers can read from any controller (read replicas). Adds cost but reduces per-controller load.
Optimization 4 - Increase metadata cache TTL: Set metadata.max.idle.ms=900000 (15 min vs 5 min default). Brokers re-fetch less frequently. Trade-off: metadata staleness increases by 10 minutes.
Production solution: Stripe uses optimizations 1 + 2. They reduced controller CPU from 40% to 8%. Combined with 5-controller setup, metadata latency <100ms.
Follow-up: If you increase metadata.max.idle.ms to 30 minutes, can brokers miss critical updates (e.g., leader failover)? How do you maintain freshness while reducing load?
You're running KRaft with 3 controllers. A cascading failure: Controller 2 dies, then during recovery, Controller 1's disk fills up (no snapshots, append-only logs unbounded). Quorum is lost. Recover the cluster.
Prevention is key: Configure log.cleanup.policy=compact,delete on controller logs to automatically compact and delete old logs. Set log.retention.bytes=100MB to prevent unbounded growth.
If already failed: (1) Check disk space on remaining controller. If <70% full, restart it—it will catch up from other controllers. (2) If >90% full, stop the controller, clear logs manually: rm -rf log.dirs/*/meta.properties (but keep meta.properties file itself). Restart; controller will re-sync from quorum. (3) If 2+ controllers are down, quorum is lost. Recover by: (a) Bringing back Controller 1 (manually clear disk if needed). (b) Bringing back Controller 2. (c) If both offline, restart Controller 1 in isolation (it becomes leader of 1-node quorum), then add Controller 2 back (it catches up).
Pitfall: If you delete the entire logs directory, controller loses state. Only delete contents, not the directory or metadata files.
Production recovery: Uber experienced this; took 4 hours to recover. They now monitor KRaft log growth and auto-trigger compaction when reaching 80% retention.
Follow-up: How do you validate that recovered KRaft state is consistent with broker state? Should you rebuild the entire cluster from scratch or trust the quorum?
Post-migration to KRaft, you enable auto.create.topics.enable=true to auto-create topics on first produce. However, topic creation adds 2-5 seconds latency to producer. Explain the bottleneck and optimize.
Bottleneck: Auto topic creation triggers metadata propagation: (1) Broker receives produce request for unknown topic; (2) Broker sends CreateTopic request to KRaft controller; (3) Controller writes metadata to log (fsync ~5ms); (4) Controller replicates to followers (50-100ms over network); (5) Controller sends CreateTopicResponse back (5ms); (6) Broker updates its metadata cache; (7) Broker handles the original produce request. Total: 100-200ms latency + network round trips.
Optimization 1 - Pre-create topics: Don't rely on auto-create. Use auto.create.topics.enable=false and pre-create all topics at deploy time. Producers find topics immediately.
Optimization 2 - Batch topic creation: If you must auto-create, create topics in batches: fetch all unexistent topics from metadata cache, send bulk CreateTopic requests to controller (not one per topic).
Optimization 3 - Tune controller responsiveness: Reduce controller replication latency: set min.insync.replicas=2 (vs 3) on controller logs, add faster disks (SSD) to controllers, use low-latency network.
Optimization 4 - Async metadata cache: Broker maintains a local cache of recently-created topics (TTL 10 sec). For repeated topics, answer from cache without hitting controller.
Production fix: Shopify disabled auto-create and pre-creates topics during deployment. This reduced producer latency spikes from 2s to <10ms.
Follow-up: How do you handle topic creation failures during pre-deploy? Should the deploy fail, or should you gracefully fall back to auto-create on first produce?
KRaft controller is running. You decide to scale down from 5 controllers to 3 for cost savings. What's the procedure, and what happens to metadata consistency during downscaling?
Procedure: (1) Identify 2 controllers to remove (not the leader). (2) Update controller.quorum.voters config on all controllers to exclude the 2 targets. (3) Gracefully shut down the 2 target controllers. (4) Verify quorum still has majority (3 remaining). (5) Update controller.quorum.voters on all brokers to exclude the 2 targets (they'll stop trying to connect). (6) Decommission the 2 VMs.
Risks during transition: If you remove both 2 controllers at the same time, quorum goes from 5 nodes (need 3/5) to 3 nodes (need 2/3). If any controller crashes during this window, quorum is lost. Mitigate: remove one controller, wait 5 minutes for quorum to stabilize, then remove the second.
Metadata consistency: KRaft maintains log replication across remaining controllers. As controllers leave, they're removed from the voter set. Remaining controllers continue replicating metadata. No data loss because the log was already replicated to 3+ controllers before removal.
Gotcha: If you forget to update controller.quorum.voters on brokers, they'll keep trying to connect to the removed controllers. This causes 30-second delays on metadata fetches (retry timeout). Always update both controller and broker configs.
Follow-up: If a controller crashes unexpectedly during downscaling, and quorum is lost, can you recover without manual intervention? What's the minimum cluster size to survive cascading failures?