Kafka Interview — Consumer Lag Monitoring and Debugging

Consumer Lag Monitoring & Debugging

Consumer lag on your orders topic spiked from 50K to 2M in 5 minutes. Producers are still healthy (10K msg/sec). Walk through how you'd debug this in production.

Lag spike = (messages produced - messages consumed). Lag grows when: 1) consumers slow down, 2) producers speed up, or 3) rebalance pauses consumers. Immediate diagnosis: run kafka-consumer-groups --bootstrap-server localhost:9092 --group orders-group --describe. Look at CURRENT-OFFSET (last consumed) and LOG-END-OFFSET (last produced). If LOG-END-OFFSET is advancing but CURRENT-OFFSET stalled, consumer is stuck. Check consumer logs for errors: grep ERROR /var/log/orders-consumer.log | tail -50. Common causes: 1) Consumer GC pause—check JVM logs for "Full GC". If GC took 30s, consumer halted. 2) Consumer processing slow—if each message takes 5s to process, max throughput is 200 msg/sec. Increase parallelism or optimize processing. 3) External dependency slow—if consumer calls payment API (100ms per call), throughput is limited. Add retries/circuit breaker. 4) Rebalance in progress—check kafka-consumer-groups --group orders-group --describe --members. If member list is changing, rebalance is happening. Wait 5 minutes and recheck. 5) Consumer crashed—verify consumer is running: ps aux | grep orders-consumer. Metrics to check: kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag-max (max lag across partitions) and records-consumed-rate (messages/sec). If consumed-rate < 200 msg/sec but producers at 10K/sec, consumer is severely behind.

Follow-up: If consumer processing is 100ms/msg and you have 8 partitions, what's the max throughput?

Lag is 2M. You check consumer logs—they show no errors. Consumer heartbeat is healthy. Partition replication is good. But lag won't drop. What's the next thing to check?

Lag stuck despite healthy consumer = consumer is processing but not committing offsets. Check if offset commits are failing: run kafka-topics --bootstrap-server localhost:9092 --topic __consumer_offsets --describe to verify offset storage is healthy. If __consumer_offsets partition is offline, offset commits fail silently (depending on enable.auto.commit setting). Consumer keeps processing but offset never advances. Check JMX metric: kafka.consumer:type=consumer-coordinator-metrics,name=commit-latency-avg (should be <100ms). If commit-latency is high or increasing, offsets are backed up. Causes: 1) __consumer_offsets leader is slow (overloaded). 2) Broker disk is full (can't write offsets). 3) Offset timeout: set offsets.retention.minutes too low (default 7 days). 4) Consumer code doesn't commit: check if you're calling consumer.commitSync() or relying on auto-commit. If auto-commit: verify enable.auto.commit=true and auto.commit.interval.ms (default 5s). To fix: 1) Check broker disk: df -h. If >90%, emergency clean-up needed. 2) Verify offset topic: kafka-log-dirs --bootstrap-server localhost:9092 --describe --topic __consumer_offsets. 3) Force offset commit: restart consumer with consumer.poll(Duration.ofSeconds(5)) to trigger auto-commit. 4) Monitor: set alert if commit-latency > 1s.

Follow-up: If __consumer_offsets is corrupted, can you reset offsets to earliest and restart?

You have 3 consumers in a group across 8 partitions. Each consumer processes 100 messages/sec. Lag is 800K. Why? What's the optimal consumer count?

3 consumers × 100 msg/sec = 300 msg/sec total throughput. Lag = (messages produced - consumed). If producers are sending 300 msg/sec and consumers are consuming 300 msg/sec, lag should be flat. But lag is 800K = (800K messages) / (300 msg/sec) = ~2667 seconds = 44 minutes behind. This means producers are faster than 300 msg/sec. Actual producer rate = (lag increase rate). If lag grows 1000/sec, producers are 1000 msg/sec, consumers are 300 msg/sec. Gap = 700 msg/sec. To catch up: increase consumer throughput or add more consumers. Optimal consumer count = max(1 partition per consumer, but not more consumers than partitions). With 8 partitions, optimal = 8 consumers (each gets 1 partition, processes independently, no coordination overhead). With 3 consumers, each gets 2-3 partitions and must poll/process serially. Benefit of more consumers: each processes in parallel, throughput increases. Add 5 more consumers (total 8). Kafka rebalances, each gets 1 partition. Assuming same processing time (100 msg/sec per consumer), new throughput = 8 × 100 = 800 msg/sec. Lag will drop over time. Monitor: kafka-consumer-groups --group orders-group --describe after scale. Should show 8 members. Track lag reduction with: kafka_consumer_lag_sum{group="orders-group"} / 60 (bytes/sec catch-up rate). Healthy: lag decreases by at least producer rate.

Follow-up: If you add 8 consumers to 8 partitions but each consumer is now slower (80 msg/sec), what happens to lag?

Lag is spiking to 1M every night at 2 AM, then recovers. Diurnal pattern. What's happening?

Diurnal lag spike = predictable time-based pattern. Causes: 1) Batch job runs at 2 AM, floods Kafka with messages (producer spike). Consumers can't keep up. 2) Scheduled restart/deployment—consumer restarts at 2 AM, rebalance pauses processing for 30-60s. 3) Scheduled backup job competes for resources (disk I/O, network). 4) Time-zone batch job (users in another timezone trigger overnight workload). Diagnosis: check producer metrics at 2 AM: kafka.server:type=BrokerTopicMetrics,name=AllProduceRequestsPerSec,topic=*. If spike > 50K msg/sec (vs normal 10K), batch job is the cause. Check deployment logs: grep for 2 AM restart. Check cron jobs: crontab -l and system logs: grep 2:00 /var/log/syslog. Solution: 1) Smooth batch job—spread sends over 1 hour instead of 10 minutes. Reduce peak throughput. 2) Pre-warn consumers—scale up consumer count before 2 AM. 3) Increase buffer capacity—producers buffer up to buffer.memory (default 32MB). Set higher if spikes are large. 4) Queue batch messages to separate topic with lower throughput, then replay to main topic post-rush. To prevent lag spike: monitor JMX kafka.producer:type=producer-metrics,name=batch-size-avg. If average batch size jumps at 2 AM, batch job confirmed. Set alert: if lag > 1M at any time, page on-call.

Follow-up: If your scheduled consumer restart at 2 AM takes 5 minutes, how much lag accumulates?

One consumer in your group lags significantly (lag=500K) but other 2 consumers are healthy (lag=5K each). Partitions aren't evenly distributed. How do you fix?

One lagging consumer = partition hot-spot or consumer crash. Verify: run kafka-consumer-groups --group orders-group --describe --members to see partition assignment. If one consumer has more partitions, it has more load. Example: Consumer A has 4 partitions, Consumer B has 2. If all partitions have equal traffic, Consumer A is slower. Solutions: 1) Check if partition traffic is skewed—some partitions might have 10x more messages. Run kafka-run-class kafka.tools.JmxTool --object-name 'kafka.server:type=BrokerTopicMetrics,name=*' --attributes Count per partition. If one partition has 5M messages vs 1M for others, that's the hot partition. 2) Switch to CooperativeStickyAssignor to rebalance evenly. Set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor. 3) Check if Consumer A is slow (GC, high CPU). Compare metrics: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=orders-consumer-1,name=records-consumed-rate across consumers. If rates differ >20%, one consumer is slow. 4) Scale partitions—if 1 partition has 50% of traffic, split it. But this requires producer key change. Temporary fix: add dedicated consumer just for hot partition. Permanent: re-key data or rebalance partition distribution. Monitor: set alert if consumer lag variance > 50% (std dev of lag across consumers).

Follow-up: If a partition has 100x more messages but same consumer throughput, what's the lag delta?

You enable read_committed on a consumer to read only committed messages. Suddenly, lag jumps 3x even though producer rate didn't change. Why?

read_committed consumer stops at LSO (Last Stable Offset) < high watermark. This means: consumer can only read up to messages that are confirmed committed by producers. If producers use transactions, consumer waits for transaction to complete before reading. If many transactions are in-flight (not yet committed), consumer can't advance past them. This creates artificial lag. Example: producer sends 1000 messages in transaction, takes 5 seconds to commit. Consumer can only read first 100 messages (before transaction), then stops. Lag appears as 900 messages stuck, but they're not stuck—they're in-flight. Once transaction commits, consumer reads them immediately. Causes: 1) Producer transactions are slow (heavy processing). 2) Producer crashes mid-transaction (transaction times out after 15 min by default). 3) High transaction load (many concurrent transactions competing). To diagnose: run kafka-console-consumer --bootstrap-server localhost:9092 --topic orders --from-beginning --isolation-level read_committed and watch consumption speed. Compare to --isolation-level read_uncommitted (should be faster). Check producer logs for transaction timeouts. Monitor JMX: kafka.producer:type=producer-metrics,name=txn-commit-latency-avg. If > 1000ms, transactions are slow. Solution: 1) Optimize producer processing to reduce transaction time. 2) Disable transactions if not needed (use read_uncommitted). 3) Increase transaction.max.timeout.ms if legitimate long transactions. 4) Batch messages: send 100 messages in 1 transaction (faster than 100 individual).

Follow-up: If read_committed lag is 2M but read_uncommitted lag is 50K, where are the 1.95M messages?