Kafka Interview Questions

Message Ordering Guarantees

questions
Scroll to track progress

You're producing events for user 12345 with messages: ("create_account", ts=1000), ("add_payment", ts=2000), ("verify_email", ts=3000). The consumer receives them as: "add_payment", "verify_email", "create_account". Your key is user_id. What went wrong and how do you fix it?

Root cause: You're sending messages to partition 0, 1, 0 (different partitions) even though all have the same key. By default, Kafka's partitioner uses hash(key) % numPartitions, so identical keys should map to identical partitions. However, if you explicitly specify different partitions in send(topicPartition=X), messages bypass the partitioner and can arrive out of order.

Debugging: Check your producer code for explicit new ProducerRecord<>(topic, partition, key, value) constructor—if partition is hardcoded or varies, that's the issue. Always use new ProducerRecord<>(topic, key, value) to let the partitioner choose.

If reordering already happened: Consume all messages for user_id=12345, sort them by timestamp field (not Kafka offset), then replay. Or rebuild consumer state from changelog topic.

Long-term fix: Use acks=all (not default acks=1) to ensure replication before returning, and retries=Integer.MAX_VALUE to retry on broker failures. This guarantees order as long as key → partition mapping is consistent.

Follow-up: If you have 8 partitions and the hash function changes (e.g., JVM version upgrade), how do you safely rekey existing data without message loss or reordering during the transition?

Your consumer group has 3 instances consuming topic "orders" with 9 partitions. Instance A fails and consumer group rebalances. During rebalance, instance B processes "order_created" then "order_cancelled" out of order. Same order_id key. Explain why and fix.

During rebalancing, the stop-the-world pause allows in-flight messages from the old assignment to be processed before the new assignment begins. Instance B might receive messages from both the old partition assignment (via rebalance listener) and the new assignment, creating a window where ordering is violated.

Root cause: Instance B was assigned partitions {3, 6, 9}. Instance A (partitions {0, 3, 6}) crashed. During rebalance, partition 3 ownership transfers from A to B. If A was mid-processing "order_cancelled" for order_id X, and that message is still in partition 3 at offset Y, B might consume "order_created" at offset X, then when it takes over partition 3, it reads "order_cancelled" at offset Y—but Y < X in the new consumer's view due to lag.

Solution: (1) Use exactly-once semantics: Enable transactions with isolation.level=read_committed and enable.idempotence=true. (2) Implement a state store: Track (order_id, last_event_offset) in RocksDB or compacted state topic. (3) On rebalance, pause consumption, flush state, resume. Use ConsumerRebalanceListener.onPartitionsRevoked() to perform cleanup before losing partitions.

Production example: Stripe's payment pipeline uses a rebalance listener to flush pending writes to a changelog topic before rebalancing, ensuring order during failure.

Follow-up: How does rebalance-time ordering interact with exactly-once semantics? If a transaction is in-flight during rebalance, should it be committed or rolled back?

You're running a Kafka Streams app that joins stream A (user pageviews) with stream B (user properties). Stream B updates happen 50ms after stream A events due to eventual consistency. Order is violated: (user_id=123: event="view", then join sees stale properties). Design a fix.

This is a causal ordering issue, not Kafka partition ordering. Stream A and B are independent topics with independent ordering guarantees. Kafka only guarantees order within a partition, not across partitions or topics.

Solution: (1) Wait-on-join: Use KStream#join(GlobalKTable) with a grace period: .join(properties, (pageview, props) -> enrich(pageview, props), JoinWindows.of(Duration.ofMillis(100)), ...). This tells Kafka Streams to wait up to 100ms for late-arriving properties before joining; (2) Changelog topic: Store properties in a compacted state topic; every time properties update, emit to changelog. The join can read the latest version from the state store; (3) Reorder buffer: Implement a local buffer on the join operator that holds pageviews for 100ms, allowing properties updates to arrive first.

Trade-off: Waiting 100ms increases latency for all joins. Monitor join hit/miss rate and adjust grace period based on actual eventual consistency lag.

Follow-up: If you set grace period to 1 second, how much latency does this add to your pipeline? How do you measure the actual eventual consistency lag between stream A and B?

Your producer sends messages with idempotence enabled. A network timeout occurs mid-request, causing the producer to retry. The broker receives both the original and retry. What happens, and how is ordering maintained?

With idempotence enabled (enable.idempotence=true), Kafka deduplicates at the broker level using the producer's (PID, SequenceNumber) tuple. Each message gets a sequence number starting from 0 per PID. If the broker receives a duplicate (same PID + sequence), it returns success without appending.

How it works: Broker maintains a dedup window (default 5 producer session timeouts, ~5 minutes). When a retry arrives with the same sequence, it returns the original append result without creating a new offset. This ensures: (1) No duplicate messages in the log; (2) Order is preserved because sequence numbers are strict monotonic; (3) Retries are safe and transparent to the consumer.

Interaction with ordering: If message A (seq=1) arrives, then message B (seq=2) arrives, but message A's ack is lost and retried, the broker sees: A (seq=1, new), B (seq=2, new), A (seq=1, dup—ignored). Result: offsets 0, 1 contain A, B in order. Consumer sees A, then B—no reordering.

Production scale: LinkedIn processes 1+ million msg/sec with idempotence. Dedup overhead is <1% CPU due to in-memory PID tracking.

Follow-up: What happens if a producer crashes and restarts with a new PID, but sends the same sequence numbers? Can you detect if a replay is from the same logical producer or a different one?

You deploy a Kafka Streams topology that aggregates user sessions. The topology version changes, and you replace the state store. During deploy, some messages arrive out of order relative to the new state store initialization. Design a redeployment strategy that preserves order.

Rolling deploy strategy: (1) Scale up new replicas of the Streams app with new topology; (2) Use consumer group protocol to rebalance: old replicas stop consuming, new replicas take over; (3) During transition, use a changelog topic to ensure state is transferred. Set log.cleanup.policy=compact on the changelog topic and initialize the new state store by replaying the entire changelog from the beginning; (4) After new state store is fully populated, resume consumption from the last committed offset.

Implementation: In Streams, use setLocalCustomSupplier(stateStore) to initialize state before consuming. Configure processing.guarantee=exactly_once to ensure the old and new state stores don't diverge during transition.

Avoiding ordering violations: Pause new replicas until they've replayed the full changelog and caught up to the committed offset. Use a pre-deploy hook to drain in-flight messages: streams.close(Duration.ofSeconds(30)) waits for all pending state updates to flush before exiting.

Production pattern: Rebalance new Streams topology after state store is warm, not before. Cold stores cause reordering of aggregations until they catch up.

Follow-up: If the changelog topic itself has gaps due to retention, how do you rebuild state on the new topology? Should you replay from source topics or accept eventual consistency?

Your consumer has max.poll.records=10 and consumes batches every 100ms. A slow message processing (one message takes 1.5 seconds) delays the next batch fetch, violating message arrival order during the delay. Debug and fix.

This is not an ordering violation—it's a processing delay issue. Kafka guarantees order within a partition and batch, not processing order. However, if you have downstream systems that depend on processing order (e.g., state machine), slow processing of one message can cause the next batch to be consumed out of order relative to wall-clock time.

Root cause: max.poll.records=10 means you fetch 10 records in one poll. If one takes 1.5s to process, the next poll (which should happen every 100ms) is delayed by 1.4s. The broker is unaware of processing delays—it continues appending new messages to the partition. When the next poll happens, you get records that arrived during your processing delay.

Fixes: (1) Use threading: Process records in a separate thread pool, allowing poll loop to continue fetching; (2) Use Kafka Streams: It handles threading automatically and provides guarantees; (3) Increase session.timeout.ms (default 10s) to give processing more time; (4) Reduce max.poll.records to 1-2 to fetch more frequently and reduce max processing time in a batch.

Trade-off: Option 1 (threading) adds complexity but maintains throughput. Option 3 (increase timeout) reduces rebalance frequency but increases failure recovery time.

Follow-up: If you enable multi-threaded processing with 8 worker threads, how do you maintain order for messages with the same key that depend on previous processing results?

You're implementing a distributed transaction across two topics: "debit" and "credit" for the same account. A producer atomically sends debit to topic A, then credit to topic B. A consumer crash causes one to be processed without the other. How do you ensure transactional consistency?

This requires exactly-once semantics across topics. Kafka transactions guarantee atomicity within a producer write, but not across independent consumer reads without additional orchestration.

Solution using Kafka transactions: (1) Produce both messages in a single transaction: producer.beginTransaction(); producer.send(debitRecord); producer.send(creditRecord); producer.commitTransaction(); This ensures both are written atomically—either both exist or neither does. (2) Consume both with isolation level: isolation.level=read_committed ensures you only read committed transaction data. (3) Use two separate consumer groups, one for each topic, and implement a third orchestrator service that calls both consumer offsets to ensure they're in sync before committing.

Alternative—State Store: Produce debit and credit to a single topic as a compound message: { type: "transaction", debit: {...}, credit: {...} }. Single topic = single ordering guarantee. Consumer sees both atomically.

Production pattern: Payment systems use transactions + orchestrator. Debit producer writes to debit topic within transaction; orchestrator waits for both consumers to process, then commits the ledger update only after confirming both are safe.

Follow-up: If the orchestrator crashes after consuming debit but before consuming credit, how do you detect and recover from this half-processed transaction?

You have a changelog topic (compacted) that records user balance updates. You replay the changelog to rebuild state. However, deletes (tombstones, null values) arrive at different times than inserts, causing temporal ordering issues. How do you handle this?

Tombstone (null value) messages in a compacted topic signal deletion, but they don't guarantee temporal deletion happened at that offset. Kafka compaction retains the latest message per key; if a key is deleted, a tombstone is kept until the retention window clears it.

Replay strategy for order: (1) Read compacted topic sequentially from offset 0. When you see a key, that's the current state (latest version). Tombstones mean "this key is deleted"; (2) Build state as a map: for each key, the last value seen (tombstone = deleted key, don't include in map). (3) After replaying all offsets, your state map reflects the current exact state without ordering issues.

Pitfall: If you process messages out of order (seek to random offset and process), you might see a delete (tombstone) before the insert, leading to incorrect state. Always replay compacted topics sequentially from offset 0.

Production pattern: Kafka Streams state stores automatically handle this. When you initialize a state store from a changelog topic, it replays sequentially and handles tombstones correctly. Don't manually consume compacted topics out of order.

Follow-up: If a compacted topic's retention deletes a tombstone before you replay, can you still reconstruct historical deletes? Should you keep a separate delete log topic for audit purposes?

Want to go deeper?