Kafka Interview — Partition Strategy and Scaling Limits

Partition Count & Scalability Limits

Your orders topic has 10 partitions. You're hitting 100K msg/sec and need 500K. You decide to scale to 100 partitions. Walk through the risks and mitigation.

Risks of 10→100 partition scale: 1) Rebalancing impact: Kafka must reassign 90 new partitions. Producers/consumers rebalance (lag spike, potential outage). 2) Memory overhead: each partition = metadata in memory (leader epoch, ISR tracking). 100 partitions × 3 replicas × 1MB per replica = 300MB+ (manageable but adds up at scale). 3) Recovery time: if broker dies, must replicate 100 partitions from this broker to others (I/O intensive). 4) Consumer coordination: 100 partitions need 100 consumers for optimal parallelism (or consumers handle multiple partitions, reducing throughput). 5) Partition leadership: controller must track leadership for 100 partitions (adds latency to failure detection). Mitigation: 1) Plan reassignment: run kafka-reassign-partitions --bootstrap-server localhost:9092 --topics-to-move-json-file topics.json --broker-list 0,1,2,3,4 --generate to create plan. 2) Throttle replication: set --throttle 10485760 (10MB/s) to slow rebalance and avoid saturating network. 3) Execute during low-traffic window (3-6 AM). 4) Monitor: watch kafka-reassign-partitions --verify and broker-metrics BytesInPerSec. 5) Gradual scale: don't jump 10→100. Go 10→30→60→100 incrementally (each step = 2-3 hours reassign + stabilization). 6) Expand consumer count: if you had 5 consumers, scale to 100 (1 per partition). This requires code change and redeployment. Best practice: plan partition count upfront. 10 partitions is not easily reversible (reducing partitions = data loss).

Follow-up: If reassignment times out (stuck at 50%), how do you resume or rollback?

You have 100 partitions but only 3 producers. Each producer sends to random partition (round-robin). Is this optimal?

Not optimal. With 100 partitions and 3 producers: each producer covers 33 partitions (uneven). Some partitions get no traffic, others get overloaded. Throughput per partition = (total throughput) / (100 partitions) = ~1K msg/sec if total is 100K. If 1 producer targets 1 partition, it sends 1K msg/sec (idle for the other 66ms per 100ms window). CPU underutilized. Better: batch messages and send to same partition (affinity). If all messages for customer X go to partition X, you get: 1) Sequential ordering per customer (if using customer_id as key). 2) Partition aggregation (fewer round-trips). 3) Efficient batching (linger.ms allows batch to accumulate on same partition). To implement: set producer ProducerRecord(topic, key=customerId, value=...). Kafka uses key to hash to partition (MurmurHash2). All messages for customerId=123 go to same partition. Optional: custom partitioner to control mapping. Example: public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) { return Integer.parseInt(key.toString()) % cluster.partitions(topic).size(); }. Downside: if 1 customer sends 50K msg/sec but other customers send 1K each, that partition is hot (one broker overloaded). Solution: 1) Add sharding: partition key by (customerId % 10), use topic sharding. 2) Monitor: alert if partition lag variance > 50% (some partitions far behind). 3) Rebalance: if hot partition, split customer across multiple keys (e.g., customerId, batch_id).

Follow-up: If you have 100 partitions with 1 key per partition, and need to handle 100x scale, what happens?

Kafka has a hard limit: max 1M partitions per broker. You're at 500K partitions across 5 brokers (100K per broker). What happens as you approach the limit?

As partitions approach 1M per broker: 1) Metadata overhead: broker must load all partition metadata into memory. 500K partitions = ~500MB+ metadata (depends on ZooKeeper/KRaft). JVM heap increases. 2) Startup time: broker restart takes minutes (must load all partition state). 3) Controller latency: election/leadership changes take longer (more partitions to update). 4) Log cleaner thread: if using compaction, cleaner must scan 500K partitions (slow, high I/O). 5) Follower replication: broker must replicate to 500K partitions on other brokers (network congestion). At 1M: broker hits hard limit, refuses new partitions. Error: TopicAlreadyExistsException or KafkaStorageException. This is why scale beyond 500K partitions per broker is not recommended. To handle 1M+ partitions: 1) Multi-tier: split data across multiple Kafka clusters (cluster A: 500K partitions, cluster B: 500K partitions). Route topics to appropriate cluster. 2) Reduce per-broker partition count: add more brokers. With 10 brokers, 1M partitions = 100K per broker (safe). 3) Use topic sharding: instead of 1 topic with 100K partitions, use 10 topics with 10K partitions each. Producers distribute across topics. Consumers track all 10 topics. JMX metric to monitor: kafka.server:type=ReplicaManager,name=PartitionCount (partition count per broker). Alert if > 400K (room to grow before hard limit). To future-proof: design with partition count budgets. If you have 10 topics × 1000 partitions each = 10K per broker budget. Plan for 10x growth before redesign needed.

Follow-up: If you're at 900K partitions and need to add 200K, what's your recovery plan?

You have 1000 partitions across 3 brokers. Network bandwidth is 1Gbps per broker. At what message size do you hit network saturation?

Calculation: 1Gbps = 125 MB/sec. Total partitions = 1000 = each partition gets 125MB / 1000 = 0.125 MB/sec = ~125KB/sec per partition. If average message size is 1KB, each partition can handle ~125 msg/sec. If you have 10 producers sending to 100 partitions each (10 messages per producer), each partition gets 10 messages = 100 msg/sec (safe). But if 1 partition is hot (1000 msg/sec), it tries to send 1MB/sec. With network limit 125MB/sec, this partition saturates link for 1MB/sec / 125MB/sec = 0.8% of time. Multiple hot partitions: if 10 partitions are each hot (1000 msg/sec each), total = 10MB/sec (8% of 125MB/sec). Network can handle, but broker CPU for serialization, compression, replication may be bottleneck. To handle larger message sizes: 1KB messages @ 125KB/sec = 125 msg/sec per partition. 10KB messages @ 125KB/sec = 12.5 msg/sec per partition (10x lower throughput). 100KB messages = 1.25 msg/sec per partition. For 100KB messages and 125 msg/sec target: need 100 × 125 = 12.5 MB/sec per partition = 12.5 × 1000 = 12.5 GB/sec total (not feasible on 1Gbps). Solution: 1) Compress messages (snappy: 10KB → 3KB). Now 12.5 MB/sec → 3.75 MB/sec (fits). 2) Increase network bandwidth (upgrade to 10Gbps). 3) Reduce message size (batch/aggregate messages). 4) Add more brokers (spread partitions). Monitor: kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=* to measure actual throughput. Alert if approaching 125MB/sec.

Follow-up: If you have 1000 partitions and 1 broker dies, how long to recover 1000 partitions?

You partition by timestamp: partition 0 = 2024 Jan, partition 1 = 2024 Feb, etc. Over 5 years, you have 60 partitions. Latest partition (Dec 2028) has 90% of traffic. Older partitions have 10% of traffic. Is this scalable?

Not scalable. You have partition hot-spot: latest partition is overloaded, others are underutilized. This is a common anti-pattern. With 60 partitions and 60 consumers, each partition gets 1 consumer. Latest partition consumer is overloaded (90% of traffic = ~9K msg/sec if total is 10K). Older partition consumers are idle (100 msg/sec each). Throughput per consumer is uneven. Better design: partition by business logic (customer_id, order_id) not timestamp. Timestamps are natural hot-spot (all recent data goes to latest partition). If you must use timestamp partitioning: 1) Add secondary key: partition key = (customer_id, timestamp). Hash to partition. Newer data spread across multiple partitions (not all in latest). 2) Use time-based sharding: instead of 1 partition per month, use 10 partitions per month (60 months × 10 = 600 partitions). Latest month has 10 partitions (each gets 900 msg/sec from 9K total). Better distribution. 3) Monitor and scale: track partition lag. If latest partition lag grows faster than others, add more partitions to latest month. Kafka 2.1+ supports adding partitions without reassignment (new partitions are auto-assigned). 4) Age-off data: use retention policy. Delete old partitions after 6 months (keep only 6 partitions). Reduces partition count, focuses resources on hot partitions. To detect hot-spot: query JMX kafka.log:type=Log,name=Size,topic=*,partition=* per partition. If partition 59 (latest) is 2x larger than partition 0, hot-spot detected. Alert and plan redesign.

Follow-up: If you have 1000 partitions and want to merge old partitions into 1 archive partition, is this possible?

You scale from 10 to 1000 partitions. Consumer code fetches max.poll.records=500 per poll (1000 partitions × 500 = 500K records in memory). Does this crash?

Potentially yes. Consumer memory usage: 500K records × (record size + metadata overhead) = memory spike. Default heap is 1GB JVM. If each record is 1KB, 500K records = 500MB. With overhead, pushing close to 1GB limit. If GC can't keep up, consumer crashes with OutOfMemoryError. Solution: 1) Reduce max.poll.records: set to 100 (instead of 500). With 1000 partitions, consumer fetches 100K records per poll (manageable). Trade-off: more polls needed = more latency. 2) Increase JVM heap: set -Xmx4g to allow 4GB heap (4x buffer). More memory for buffering. 3) Reduce partition count: don't have 1000 partitions if single consumer must poll all. Spread across multiple consumer groups or applications. 4) Use batch processing: if poll is returning 500K records, process in batches (100 at a time), then poll again. Better: while (true) { ConsumerRecords records = consumer.poll(Duration.ofSeconds(1)); records.forEach(this::processRecord); consumer.commitSync(); }. This processes immediately, doesn't buffer. Important: with 1000 partitions and 1 consumer, each poll spans 1000 partitions (Kafka fetches from all assigned partitions in parallel). If network is slow, fetch can take 1-10s. During fetch, consumer is blocked (can't process). Rebalance timeout may trigger. Recommendation: don't assign 1000 partitions to 1 consumer. Ideal: 1 partition per consumer (1000 consumers for 1000 partitions). If not possible: use max.poll.records=10 and max.poll.interval.ms=300000 (5 min) to give consumer time to process.

Follow-up: If you have 1000 partitions and 10 consumers, what's the memory usage per poll?