Kafka Interview — Performance Tuning and Benchmarking

You need to push Kafka throughput from 100K msg/sec to 1M msg/sec (10x increase). Current setup: 10 brokers, 50 partitions, message size 1KB. Walk through tuning strategy, hardware scaling, and benchmarking methodology.

Bottleneck analysis: 100K msg/sec = 100MB/sec total throughput = 10MB/sec per broker. 10MB/sec per broker is conservative (brokers can handle 100-500MB/sec each). So either: (1) network is bottleneck, (2) disk I/O is bottleneck, (3) configuration is suboptimal.

Phase 1 - Software tuning (no new hardware): (1) Increase num.network.threads=12 (vs default 3). Allow more concurrent connections. (2) Increase num.io.threads=8 (vs default 8, already good). (3) Set socket.send.buffer.bytes=1048576 (1MB, vs 102KB default). Allows larger TCP send window. (4) Set socket.receive.buffer.bytes=1048576 similarly. (5) Increase batch size: producer sets batch.size=1MB (vs 16KB default). (6) Enable compression: compression.type=snappy. Reduces network load by 50% for text data. (7) Tune log flush: log.flush.interval.messages=100000 (vs default = infinite, which is good. Don't change). Expected gain: 100K → 300K msg/sec (3x).

Phase 2 - Hardware scaling: Add 10 more brokers (10 → 20 total). Each broker can now serve 10MB/sec (half previous load). Rebalance partitions from 50 → 100. Expected gain: 300K → 600K msg/sec (2x).

Phase 3 - Network infrastructure: Upgrade network from 1Gbps to 10Gbps. Upgrade broker NICs. Expected gain: 600K → 1M msg/sec (1.67x).

Benchmarking methodology: (1) Baseline: measure current throughput with kafka-producer-perf-test.sh --topic test --num-records 1000000 --record-size 1024 --throughput -1 --producer-props bootstrap.servers=localhost:9092; (2) After each tuning step, re-run benchmark. Measure in isolated environment (no background traffic). (3) Run for 5 minutes minimum to get stable throughput. (4) Measure latency: p50, p95, p99. Track trade-off between throughput and latency.

Key metrics: (1) Producer throughput: msg/sec. (2) Broker CPU: should be 60-80% (headroom for spikes). (3) Network utilization: should be <80%. (4) Disk I/O: should be <80% IOPS. (5) End-to-end latency: p99 should be <100ms for SLA.

Production example at LinkedIn: They scale from 100K to 10M msg/sec via similar approach: software tuning (2x), hardware scaling (50x), network upgrade (2x), and compression (2x). Total: ~200x improvement without changing core architecture.

Follow-up: If you scale to 20 brokers and rebalance 50 → 100 partitions, during rebalancing (leader election, data movement), does client throughput drop? What's the recovery time?

You benchmark two setups: Setup A (sync replication, acks=all) vs Setup B (async replication, acks=1). Setup A: 100K msg/sec, p99=50ms. Setup B: 1M msg/sec, p99=5ms. Setup B crashes mid-benchmark, broker dies and doesn't recover. Why did Setup B lose data?

Root cause difference: Setup A (acks=all): Producer waits for all in-sync replicas (ISR) to acknowledge write before returning. If broker crashes immediately after, at least one replica has the data (ISR >= 2). No data loss. Setup B (acks=1): Producer waits only for leader to acknowledge, not followers. If leader crashes before replicating to followers, data is lost (only exists on crashed broker).

Throughput vs durability trade-off: (1) acks=0: fire-and-forget, 10M msg/sec, no durability (highest data loss risk). (2) acks=1 (leader only): 1M msg/sec, low durability (data loss if leader crashes). (3) acks=all (all ISR): 100K msg/sec, high durability (no data loss unless all ISR crash simultaneously). (4) acks=all + min.insync.replicas=2: 100K msg/sec, medium durability (survive single broker failure).

Production choice: Use acks=all with min.insync.replicas=2 for critical data (payments, audit logs). Use acks=1 for non-critical data (metrics, logs). This balances throughput and durability.

Benchmarking lesson: When comparing setups, control for durability. If Setup A uses acks=all and Setup B uses acks=1, they're not equivalent. Throughput difference is partly due to durability, not purely performance.

Fair comparison: (1) Setup A (acks=all, min.insync.replicas=2): 100K msg/sec. (2) Setup B (acks=1, no replication): 500K msg/sec. (3) Setup C (acks=all, min.insync.replicas=2, batching optimized): 150K msg/sec. Conclusion: optimize within your durability tier, don't sacrifice durability for speed.

Follow-up: If you need 1M msg/sec throughput with acks=all durability, what architectural changes are needed? Is it even possible?

You benchmark consumer performance. Consumer lag is 5 minutes (processing is slow). You increase num.threads from 1 to 4. Lag drops to 1 minute. You increase to 8 threads. Lag still 1 minute (not improving). Why?

Consumer threading model: KafkaConsumer is single-threaded by design. num.threads doesn't control KafkaConsumer threads; it controls application-level processing threads. Adding threads helps only if processing is CPU-bound parallelizable. If processing is I/O-bound (network calls, database queries), threads help until I/O becomes bottleneck.

Lag analysis: (1) 1 thread: process 1 message/sec (slow processing). 5 min lag = 300 messages backlog. (2) 4 threads: process 4 msg/sec (parallelizable). Lag drops to 1 min = 60 messages. (3) 8 threads: process ~4 msg/sec still (I/O bottleneck reached). Lag doesn't improve.

Root cause: Likely downstream dependency (database, API) is bottleneck. With 1 thread: serial calls to DB, high latency. With 4 threads: parallel calls to DB, DB becomes saturated. With 8 threads: still saturated (you're now limited by DB throughput, not Kafka consumer).

Optimization: (1) Profile consumer: add logging to measure per-message processing time. If time > 100ms, likely I/O; (2) Check downstream dependency: if calling external API, measure API latency. If API is slow, no amount of threading helps. (3) Add connection pooling / caching: instead of per-message API call, batch 10 messages, single API call. Reduces latency 10x. (4) Add async I/O: instead of blocking on DB query, use async drivers, handle callbacks. (5) Scale downstream: if DB is bottleneck, add read replicas, cache layer.

Production pattern: Consumer lag is often not a Kafka problem, but an application problem. Before scaling Kafka, profile and scale the bottleneck (usually processing logic or external dependencies).

Follow-up: If you add connection pooling and reduce per-message latency from 100ms to 10ms, can you now achieve 1000 msg/sec with 8 threads?

You run Kafka broker benchmark on a single machine (8-core CPU, 64GB RAM, 2TB SSD). Throughput plateaus at 500K msg/sec. You add a second broker on same machine (2x brokers, 4-core CPU each). Throughput is still 500K msg/sec total. Why didn't it double?

Resource bottleneck: Single 8-core CPU shared by 2 brokers = 4 cores per broker (hyperthreading = 8 logical cores per broker). Network bandwidth is shared: 2 brokers on same NIC = halved network per broker. Disk I/O is shared: SSD has fixed IOPS (e.g., 100K IOPS for write-heavy workloads), split between 2 brokers = 50K IOPS per broker.

Root cause likely: Network or disk I/O is bottleneck, not CPU. Adding brokers without adding network/disk doesn't scale.

Verification: (1) Monitor CPU: if CPU is 100% on one core, CPU bottleneck (thread contention). If CPU is 40%, not bottleneck. (2) Monitor network: if NIC throughput is 1Gbps, network is bottleneck (add faster NIC). (3) Monitor disk: if disk IOPS are at max, disk is bottleneck (add faster storage or RAID).

Scaling strategy: (1) If CPU bottleneck: add more cores or machines. (2) If network bottleneck: upgrade NIC from 1Gbps to 10Gbps. (3) If disk bottleneck: upgrade from HDD to SSD, or add RAID 0 for striping. (4) Avoid co-locating multiple brokers on same machine (testing only). Production: 1 broker per machine for resource isolation.

Production topology: Kafka cluster should span multiple physical machines on different network switches, different storage arrays. This avoids single point of failure and resource contention.

Follow-up: If you have 8-core CPU and each broker thread uses 1 core, can you run 8 brokers on 1 machine? What's the trade-off in terms of SPOF (single point of failure)?

You benchmark Kafka on two different hardware setups: Setup A (general-purpose VM, SSD, 10Gbps network), Setup B (local NVMe, 40Gbps network). Setup B is 5x faster (500K → 2.5M msg/sec). Benchmark results vary by 20% on repeated runs. How do you reduce variance and ensure reproducible results?

Sources of variance: (1) OS noise (garbage collection, kernel scheduling, other processes competing for CPU). (2) Network jitter (latency variability, packet loss, retransmits). (3) Disk cache effects (first run cold cache, second run warm cache). (4) Warm-up time (JVM JIT compilation takes ~30 seconds).

Techniques to reduce variance: (1) Warm-up phase: Run benchmark for 5 minutes before measuring. This allows JVM to JIT-compile hot paths, caches to warm up, Linux disk cache to stabilize. Then run 10-minute measurement phase. Discard first 2 minutes of measurement. Report average of last 8 minutes. (2) Disable CPU frequency scaling: Set CPU governor to performance mode: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor. Prevents CPU from scaling down and causing latency jitter. (3) Flush disk cache before runs: sync; echo 3 > /proc/sys/vm/drop_caches. Ensures consistent starting state. (4) Isolate test machine: Shut down non-essential services (logging, monitoring, SSH if possible). Prevent background tasks from interfering. (5) Use high-resolution clock: Java benchmark should use System.nanoTime() (not System.currentTimeMillis() which has low resolution on some systems).

Benchmark reporting: Report mean ± std dev (not just mean). Example: "500K msg/sec ± 5% (variance from 475K to 525K)". If std dev is >10%, variance is too high, re-run with better isolation.

Production example at Confluent: They run benchmarks on isolated machines, warm-up 5 min, measure 10 min, repeat 5 times, report median ± percentiles (p5 to p95). This captures variance across runs.

Follow-up: If Setup A has 20% variance and Setup B has 5% variance, does that mean Setup B is more stable? Or is it just lower noise due to higher throughput?

You benchmark end-to-end latency: producer → broker → consumer. At 100K msg/sec, p99 latency is 50ms. At 500K msg/sec, p99 latency is 500ms. At 1M msg/sec, the benchmark crashes (out of memory). Diagnose and optimize.

Latency degradation: As throughput increases, queue depths grow. At 100K msg/sec, broker queue is shallow (avg queue size = 1 message). At 500K msg/sec, queue depth is 10x deeper (10 messages waiting). Each message waits 50ms, so p99 = 500ms. This is expected under high load.

Root cause of OOM at 1M msg/sec: (1) Producer is producing faster than broker can consume (1M msg/sec vs broker limit 500K msg/sec). Messages accumulate in producer's send buffer (unbuffered queue). Producer allocates memory for 500K backlogged messages. Each 1KB message = 500MB backlog. After 1-2 seconds, producer runs out of heap memory (default 1GB). (2) Broker is GC'd continuously trying to handle 1M msg/sec, causing GC pauses (stop-the-world). JVM can't allocate new objects, OOM.

Optimization 1 - Throttle producer: Set max.in.flight.requests.per.connection=1 to prevent producer from sending faster than broker can consume. This prevents memory accumulation. Trade-off: latency increases (producer waits for each message).

Optimization 2 - Increase broker throughput: Broker can only handle 500K msg/sec on current hardware. To benchmark at 1M msg/sec, scale broker: more CPU, faster network, faster disk. Then re-benchmark.

Optimization 3 - Tune JVM heap: Increase heap: -Xmx8G (vs default 1G). More memory for buffer. But this just delays OOM. If producer tries to produce 2M msg/sec, even 8GB heap will fill.

Correct approach: Benchmark should never exceed broker's sustainable throughput. If broker can sustain 500K msg/sec, benchmark at that. Observe that latency grows under load (expected). p99 latency at 500K/sec might be acceptable SLA (e.g., <500ms). If not, scale broker.

Production lesson: High throughput tests should be capacity tests: find the maximum sustainable throughput where latency is acceptable. Don't just push until it crashes.

Follow-up: If your SLA requires p99 latency <100ms at 1M msg/sec, is it achievable on commodity hardware? What would the cluster look like?

You benchmark Kafka with message sizes 1KB, 10KB, 100KB, 1MB. Throughput (msg/sec) decreases: 500K, 250K, 100K, 50K. But bandwidth (MB/sec) stays constant at ~500MB/sec. Explain and optimize for large messages.

Throughput vs bandwidth: Throughput measures number of messages. Bandwidth measures bytes transferred. If message size increases but bandwidth stays constant, per-message overhead (parsing, batching, network round-trips) dominates. Example: 500K × 1KB = 500MB/sec. 50K × 10MB = 500MB/sec. Same bandwidth, different throughput.

Root cause: Kafka's per-message overhead (headers, checksums, batching, network latency) is fixed. For small messages (1KB), overhead = 10% of message. For large messages (1MB), overhead = 0.1% of message. Throughput is limited by overhead, not bandwidth.

Optimization for large messages: (1) Increase batch size: Kafka batches messages to amortize overhead. For 1MB messages, set batch.size=10MB (vs 1MB default). This batches 10 large messages together, reducing per-message overhead. Throughput: 50K → 100K msg/sec (2x). (2) Enable compression: Large messages compress well. compression.type=snappy reduces size by 2-3x (text data). Effective bandwidth: 500MB/sec → 1500MB/sec. Throughput: 50K → 150K msg/sec. (3) Reduce batch wait time: Set linger.ms=50 (vs default 0). Wait up to 50ms to accumulate messages into batch. Increases batching ratio. (4) Tune broker buffer: For large messages, set socket.send.buffer.bytes=4MB (vs 1MB default). Allows broker to send larger chunks in single TCP frame, reducing per-frame overhead.

Expected gains: Batching (2x) + compression (3x) + buffer tuning (1.5x) = 9x throughput improvement: 50K msg/sec → 450K msg/sec. Bandwidth increases to 4500MB/sec (requires 40Gbps network).

Production trade-off: Large message optimization requires lower latency tolerance (wait up to 50ms to batch). For real-time use cases, this might not be acceptable. Use for batch/archival workloads instead.

Follow-up: If you need 500K msg/sec throughput with 1MB messages (500GB/sec bandwidth), is that realistic on commodity hardware? What network/storage would you need?