Producer Batching, Compression & Acks
Your producer throughput is 10K msg/sec. You need 500K msg/sec. Your current config is acks=1, no batching. What 3 tuning knobs do you turn first?
1) Enable batching: set linger.ms=10 (default 0). This tells producer to wait 10ms before sending, allowing multiple messages to batch together. Instead of sending 10K individual requests/sec, send 1K batched requests (10 msgs each). Throughput increases but latency increases +10ms. 2) Increase batch size: set batch.size=32768 (32KB, default 16KB). If messages are small (100 bytes), now fit 327 messages per batch (vs 160 before). Each batch = 1 network round-trip. Fewer trips = higher throughput. 3) Enable compression: set compression.type=snappy. Each batch is compressed (reduces network bandwidth by 3-10x). Producer can send 3x more data with same network bandwidth. Combined: 10K msg/sec × 327 msgs/batch (vs 160) × 0.333 (snappy compression) = 10K × 2x × 0.33 ≈ 6.6K msg/sec (marginal). More effective: increase acks=0 (fire-and-forget). Producer doesn't wait for broker ack. Throughput jumps to 50K+ msg/sec immediately. But risk: if broker dies, no guarantee message was stored. Safe for non-critical data (logs, metrics). For critical (orders): stick with acks=1 or all. Real 50x gain needs: 1) More producer threads/processes. Scale horizontally. 2) Larger broker cluster (more partitions). 3) Tune OS: increase socket.send.buffer.bytes to 256KB (from 102KB). 4) Tune JVM: increase heap to 4GB, tune GC (use G1GC for low latency). Benchmark first: run with kafka-producer-perf-test --topic test --num-records 1000000 --record-size 1000 --throughput -1 to measure baseline.
Follow-up: If you set linger.ms=10 and batch.size=32KB, which limit is hit first?
You enable batching and compression. Latency increased from 5ms to 25ms (linger.ms=20, snappy compression adds 5ms). Can you reduce it back to 10ms while keeping throughput gains?
Yes. Tuning trade-offs: 1) Reduce linger.ms: set to 5 instead of 20. Latency decreases, but batching is less effective. Batches form faster (less time to accumulate). Throughput drops slightly but latency improves. 2) Use faster compression: switch from snappy to lz4 (faster compression). Compression overhead: snappy ~5ms, lz4 ~2ms for 32KB batch. Change compression.type=lz4. Downside: lz4 compresses less (4x vs snappy's 6x). Trade-off: faster latency, slightly lower compression ratio. 3) Enable idempotence + compression only for large batches. Add custom logic: if batch < 16KB, don't compress (overhead not worth it). If batch > 16KB, compress. Kafka doesn't have this option natively—requires wrapper. 4) Use acks=1 instead of acks=all. Reduces latency (broker acks immediately after leader receives, doesn't wait for followers). Reduces durability (if leader dies, no replica copy). For non-critical data, acks=1 is acceptable. Combining: linger.ms=5 + compression.type=lz4 + acks=1 = latency ~15ms, throughput ~200K msg/sec (vs previous 25ms, 150K). Measure: kafka-producer-perf-test --topic test --num-records 100000 --record-size 1000 --print-metrics --metric-names end-to-end-latency-avg,throughput. Pick tuning that best matches your SLA.
Follow-up: If batch.size=32KB but average message is 100KB, what happens?
You set acks=all and min.insync.replicas=2. Producer blocks for 100ms per message (waiting for 2 replicas to ack). How many msg/sec can you send with 10 threads?
Calculation: 1 thread can send 1 message every 100ms = 10 msg/sec. 10 threads in parallel = 10 × 10 = 100 msg/sec maximum. This assumes each thread independently sends and waits for ack. To increase: 1) Reduce acks latency. If min.insync.replicas=2 on 3 brokers, the second broker is slow. Investigate: check broker disk I/O, network latency. If broker 2 is lagging, upgrade its disk or reduce load. 2) Use batching + acks=all. If you batch 100 messages and wait 100ms for acks, that's 100 messages / 100ms = 1000 msg/sec (10x improvement). With 10 threads: 10,000 msg/sec. Batching is key. 3) Increase threads to 100: 100 × 10 = 1000 msg/sec (linear scaling until broker saturates). 4) Enable pipelining: set linger.ms=50, batch.size=32KB. Messages batch up while waiting for previous batch acks. Producer can have multiple in-flight batches (configured by max.in.flight.requests.per.connection, default 5). So 5 batches × 100 msg/batch × 10 threads = 5000 msg/sec. JMX metrics to monitor: kafka.producer:type=producer-metrics,name=record-send-latency-avg should be ~100ms. If latency > 100ms, acks are slow. record-send-total shows throughput. If < 1000 msg/sec with 10 threads, producer is bottlenecked (check network, broker).
Follow-up: If you increase max.in.flight.requests to 20, what's the new throughput?
You enable compression. Monitoring shows CPU on producer is at 50%, broker is 20%. Throughput is 200K msg/sec. If you disable compression, CPU on producer jumps to 90%, throughput drops to 80K msg/sec. Why?
Compression trades CPU for bandwidth. With compression: producer spends 50% CPU compressing, sends 200K msg/sec (network bandwidth ~200MB/sec with compression). Broker spends 20% CPU receiving and decompressing. Without compression: producer spends 90% CPU serializing and sending (no compression overhead), but throughput is only 80K msg/sec (network bandwidth ~80MB/sec uncompressed). Network became bottleneck. Interpretation: 1) Producer is CPU-constrained after disabling compression (90% = saturated). 2) Broker is network-constrained (20% CPU = plenty of spare CPU). 3) Network link between producer and broker is saturated (~1Gbps link can handle ~125MB/sec). With compression, 200K × 1KB (avg msg) = 200MB before compression. Compressed = 200MB × 0.33 (snappy ratio) = 66MB (fits in 1Gbps link). Without compression = 200MB (exceeds 1Gbps, drops packets, producer retries). Recommendation: keep compression enabled. 50% producer CPU is acceptable. Trade-off: higher throughput (200K vs 80K). Monitor: if producer CPU stays < 50%, compression is not a bottleneck. If > 90%, producer can't compress fast enough—upgrade CPU or use faster codec (lz4, zstd). Run jps -lm | grep Kafka to find producer process. Then top -p to monitor CPU in real-time. Also check network: iftop or nethogs to see bandwidth used.
Follow-up: If you switch to zstd compression, what's the trade-off vs snappy?
acks=0 means fire-and-forget. No acks from broker. Producer can't retry. What if broker dies after receiving but before replicating?
With acks=0: message is sent to broker, producer considers it "sent" immediately (doesn't wait for response). If broker's disk hasn't flushed to disk, and broker dies, message is lost. Producer has no way to know. This is the trade-off for acks=0: maximum throughput (no wait) but zero durability. Use cases: 1) Logs/metrics—losing occasional log message is acceptable. 2) Analytics—data already replicated elsewhere. Don't use for: payments, orders, anything critical. To reduce risk with acks=0: 1) Increase replication factor to 3. If broker dies, replicas still have copy (but broker must have replicated before crash, which acks=0 doesn't guarantee). 2) Enable retries: retries=3. If send fails, retry (but acks=0 never returns failure—only network errors trigger retry). 3) Use acks=1 instead: waits for leader ack (message persisted on leader disk). Much safer than acks=0, latency only +5ms. Recommended: use acks=1 as minimum for production. Reserve acks=0 for non-critical data. acks=all is 100ms latency but guaranteed durability. Timeline with acks=0: T=0: producer.send() starts. T=1us: message sent to broker (network). T=2us: producer.send() returns immediately (doesn't wait for broker ack). Throughput: no blocking, max speed. Risk: if broker dies at T=1.5us (before disk flush), message is lost. With acks=1: T=0: producer.send() starts. T=1us: message sent to broker (network). T=2-50us: broker flushes to disk. T=51us: broker sends ack. T=52us: producer.send() returns. Latency: ~50us, throughput: ~20K msg/sec (10x lower than acks=0).
Follow-up: If you have acks=1 and leader dies 1ms after acking, can replica take over without data loss?
You batch messages and compress. After compression, batch is 32KB. Broker receives it, decompresses. Does broker store compressed or decompressed?
Broker stores compressed. Kafka stores log entries as-is. If producer sends snappy-compressed 32KB batch, broker writes 32KB compressed batch to disk. Consumers read compressed batch and decompress on-demand. This is efficient: disk I/O is 32KB (compressed) vs decompressed might be 128KB. Network bandwidth is also compressed (32KB sent). Only downside: consumers must decompress (CPU on consumer side). Exceptions: 1) Re-compression: if producer uses snappy, broker can re-compress with different codec. Set compression.type=lz4 on broker. Broker decompresses snappy, compresses with lz4. Adds latency (decompress + compress overhead). Usually avoided—leave codec as-is. 2) Log cleanup: if topic has log compaction, broker decompresses to find keys, removes duplicates, recompresses. This is overhead. For compacted topics, monitor broker CPU during cleanup. To verify storage format: run kafka-dump-log --print-data-log --skip-analysis --files /var/kafka/logs/test-0/00000000000000000000.log | head -20. You'll see raw binary (compressed). To decompress manually: extract bytes and run through snappy decompressor (Python: `import snappy; snappy.decompress(bytes)`). Best practice: choose compression codec once and stick with it. Changing codec mid-stream requires re-compression. Monitor: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=test vs BytesOutPerSec. Ratio = compression ratio. If < 1.2x, compression overhead not worth it (turn it off). If > 3x, compression is excellent (keep enabled).
Follow-up: If you enable log compaction on a snappy-compressed topic, which operation happens first: decompression or duplicate removal?