Kafka Interview — Consumer Group Rebalancing Protocols

Rebalancing Protocols & Production Scenarios

Your consumer group just triggered a rebalance during peak traffic (10K msg/sec). Processing stopped for 2 minutes. When you check, lag jumped from 5K to 850K. What happened and how do you diagnose it?

Rebalance stops all processing while consumers rejoin the group and redistribute partitions. A 2-minute pause with 10K msg/sec means 1.2M messages queued. Diagnose with: kafka-consumer-groups --bootstrap-server localhost:9092 --group orders-group --describe to see lag. Check rebalance reason via broker logs (JMX metric kafka.server:type=ReplicaManager,name=IsrShrinks). Common causes: consumer heartbeat timeout (default 10s), session timeout, or GC pause. If max.poll.interval.ms was exceeded, the consumer was kicked out mid-processing. Check consumer logs for REBALANCE_IN_PROGRESS error. Mitigation: increase session.timeout.ms from 10s to 30s, lower max.poll.records from 500 to 100, or tune fetch.max.bytes.

Follow-up: Your rebalance took 45 seconds but lag only jumped 450K—why was that better? What metric tells you rebalance time?

You deployed a new consumer version with a bug. It crashes instantly after rebalancing, triggering cascading rebalances every 10 seconds for 5 minutes. 200 messages piled up. How do you stop the bleeding?

Cascading rebalances mean consumers keep crashing and rejoining. Immediate action: kill all consumer instances with bad code (pkill -f orders-consumer). This triggers one final rebalance, remaining consumers take all partitions. Check JMX metric kafka.consumer:type=consumer-coordinator-metrics,client-id=*,name=sync-rate—high sync rate + low consumer-lag-avg = thrashing. Roll back deployment. Before restarting: verify fix handles edge cases in rebalance callback. Use --to-earliest restart only if necessary. For prevention: add on_partitions_revoked() handler to gracefully commit offsets before rebalance. Monitor kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag-max to catch lag spikes. Set alert on rebalance frequency via RangeAssignor or CooperativeStickyAssignor metrics.

Follow-up: Why would switching to CooperativeStickyAssignor reduce the impact? What's the trade-off vs RangeAssignor?

You have 6 consumers in a group across 3 partitions on your orders topic. You scale to 9 consumers. Rebalance triggers. Partition ownership flips: P0/P1/P2 were on C1/C2/C3. Now they're on C4/C5/C6. Why did they move and what did this cost?

With RangeAssignor, Kafka assigns partitions linearly: consumers 0-2 get partitions 0-2, consumers 3-5 get 3-5, etc. Adding 3 more consumers shifts boundaries. P0/P1/P2 originally on C1/C2/C3 might now map to C0/C1/C2 or different slots depending on sort order. This causes a full reassignment because RangeAssignor doesn't stick partitions to consumers. Cost: every partition stops processing while offset commits flush and new consumer joins. With 3 partitions × 2-3 seconds each = 6-9s total outage. Use kafka-consumer-groups --bootstrap-server localhost:9092 --group orders-group --members --verbose to see member IDs and assigned partitions. Better option: use CooperativeStickyAssignor which tries to keep partitions on same consumer. It allows incremental rebalancing: only affected partitions move, others keep processing. Set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor.

Follow-up: How would you measure the actual outage time? What JMX metrics give you rebalance latency?

Your team uses static membership: group.instance.id=consumer-1. You restart consumer-1 for a config change. It rejoins with same instance ID but a different member ID (uuid). Rebalance? Or does it skip?

Static membership is keyed by group.instance.id, not member ID. If same instance.id rejoins within session.timeout.ms (default 10s), it's not a rebalance—it's a revocation/assignment of same partitions. This is called "fencing" old sessions. Behind the scenes: broker holds instance id for 5 minutes (configurable group.min.session.timeout.ms). So your restart is treated as a temporary disconnect, not a new member. Benefit: zero rebalance overhead. The consumer loses its position if no offset commit, but partitions don't move to others. Use this for stateful consumers or when you need predictable partition ownership. Watch session.timeout.mskafka-consumer-groups --group orders-group --describe --members --bootstrap-server localhost:9092 and look for CONSUMER instance_id column (non-null = static).

Follow-up: If session.timeout.ms expires but instance.id is still in the group, what happens to unprocessed messages on those partitions?

Your payments topic has 8 partitions. You have 12 consumers. 4 consumers sit idle, getting no partitions assigned. You check rebalancing protocol: it's set to 2.5.0 default (RangeAssignor). How is Kafka deciding who gets what, and why?

RangeAssignor assigns partitions in order: if 8 partitions ÷ 12 consumers = 0 remainder 8, Kafka gives first 8 consumers 1 partition each (or some get 1, some get 0 if uneven). The last 4 get nothing. This is the core inefficiency. The algorithm: sort consumers lexicographically, sort partitions by ID, then assign in round-robin within each topic. For even distribution, use RoundRobinAssignor (all 12 consumers share 8 partitions evenly: some get 1, some get 0, but it's fair). Better: use CooperativeStickyAssignor which minimizes movement. Optimal: scale partitions to match or exceed consumer count. Command to check: kafka-topics --bootstrap-server localhost:9092 --topic payments --describe to see partition count. Action: either scale partitions to 12+ or reduce consumer count. To force rebalance and test: kafka-consumer-groups --bootstrap-server localhost:9092 --group payments-group --reset-offsets --to-earliest --execute (soft rebalance) or restart consumers.

Follow-up: If you add a 9th partition while 12 consumers are running, which consumer gets it and when?

Your consumer code has a bug: on_partitions_revoked() throws an exception during rebalance, so offsets never commit. Rebalance completes but next consumer processes old messages. How many times will those messages be reprocessed?

If offsets don't commit during revocation, the last committed offset is used when rebalance finishes. Every message after that offset gets reprocessed. If consumer processed 100 messages but crashed before committing (default auto-commit is every 5s), it reprocesses those 100 on restart. To diagnose: check __consumer_offsets topic: kafka-console-consumer --bootstrap-server localhost:9092 --topic __consumer_offsets --formatter kafka.tools.JmxTool --property=print.key=true --from-beginning. More practical: run kafka-consumer-groups --group payments-group --bootstrap-server localhost:9092 --describe and compare LAG vs CURRENT-OFFSET. If LAG > 0 but no new messages are coming, consumers are replaying. Fix: wrap on_partitions_revoked in try-catch, ensure commitSync() completes before returning. Add metric: increment counter on revoke exception. Set enable.auto.commit=true and increase auto.commit.interval.ms to ensure offset commits flush regularly. For exactly-once: use transactional writes + manual offset commits in transaction.

Follow-up: If rebalance takes 30 seconds and auto-commit fires 6 times during that window, which offset gets committed?

You have a WebSocket server that holds state for 500 concurrent connections. Each consumer in your group handles a subset. When rebalance happens, connections drop for 2 seconds. Clients reconnect but state is lost. How do you fix this with consumer rebalancing?

This is a classic stateful consumer problem. Solutions: 1) Switch to CooperativeStickyAssignor + static membership to minimize partition movement. Set group.instance.id=server-1. On rebalance, most partitions stay assigned, only new/removed consumers trigger movement. 2) Implement on_partitions_revoked() to gracefully drain connections: stop accepting new messages, finish processing in-flight, commit offset, then release partitions. 3) Use external state store (Redis) keyed by connection_id. On reconnect after rebalance, re-hydrate state from Redis. 4) Increase heartbeat.interval.ms from 3s to 5s and session.timeout.ms from 10s to 30s to reduce rebalance frequency. 5) Implement on_partitions_assigned() to rebuild state before processing new messages. Metric: track rebalance duration via JMX kafka.consumer:type=consumer-coordinator-metrics,name=sync-latency-avg. Target: <5s. If rebalancing is still frequent, check consumer logs for crashes or GC pauses.

Follow-up: What's the minimum session.timeout.ms you'd set in production? What breaks if you set it too low?

You're monitoring consumer lag during a rebalance. Lag went from 100K to 3M in 30 seconds, then back to 100K over next 2 minutes. Explain the curve and how you'd alert on this pattern.

Lag curve during rebalance: 1) Lag stable (100K) = normal processing. 2) Rebalance triggered = processing stops, lag jumps (3M) because new messages arrive but no consumer processes them. 3) Rebalance completes = consumers rejoin, lag drops back to 100K as they catch up. This is expected and not an error—it's a lag "stall". To alert on rebalance impact: use JMX metric kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag-max and track rebalance via org.apache.kafka.common.metrics:type=app-info,client-id=* (will show version change during rebalance). Set alert on lag delta: if lag increases by >2M in <30s AND rebalance is in-progress, it's normal. If lag increases without rebalance, that's a real issue (consumer slow). To measure rebalance overhead: record lag before/after, calculate time-to-recovery. Query Prometheus: rate(kafka_consumer_lag[5m]) > 100000 AND on(group_id) kafka_consumer_rebalance_latency > 10000. Track consumer_rebalance_count per day. If >50/day, rebalance is too frequent (likely bad heartbeat/session config).

Follow-up: How would you use kafka-consumer-groups to measure rebalance latency in real-time?