System Design Interview — Message Queues vs Event Streaming

Your e-commerce platform processes 50K orders/day with peak RPS of 5K during flash sales. Orders trigger downstream: inventory deduction (40ms latency), payment processing (3s, external), fulfillment notifications (1s), and analytics pipelines. You currently use SQS with 30-second polling. Checkout latency hit 8 seconds. Should you migrate to Kafka?

No direct migration for checkout path—this is a coupling problem, not a queue problem. Checkout should be synchronous for payment (PCI compliance, immediate feedback). SQS works fine for post-purchase workflows. However, your analytics/inventory consumption patterns suggest event streaming: they need the entire order history, replay capability, and fan-out to 4+ consumers. Hybrid approach: keep SQS for critical transactional workflows (payment confirmation, fraud checks), add Kafka for secondary streams (inventory updates, analytics, fulfillment). Trade: Kafka adds operational complexity (broker management, partition rebalancing), costs ~$0.70/GB/month vs SQS's $0.50 per million requests ($3K/month SQS for your volume). If you have <100 consumers and single-use events, SQS wins. Kafka wins if you need: topic subscriptions, message retention >14 days, or stream replay (critical for rebuilding consumer state). At 50K orders/day, partition to 6 partitions (order-id % 6) ensuring same customer orders hit same partition for ordering guarantees.

Follow-up: Your inventory consumer crashed for 4 hours during midnight inventory sync. Kafka retained messages for 48 hours. How would you replay just those 4 hours without reprocessing the entire day?

You operate a real-time bidding platform: 100K concurrent auctions, bid arrivals at 2M/second during peak, bid deadline guarantees within 200ms. You're using Kafka with batch processing (commit every 10 seconds). Twice last month, clock skew on brokers caused bids from 2 seconds ago to be applied after the deadline. What's your fix?

Clock skew is a timing guarantee problem—Kafka's exactly-once semantics assume monotonic time. Your root cause: no NTP sync enforcement or insufficient sync tolerance. Immediate fix: enable NTP with chrony daemon (<10ms sync), set broker `log.message.timestamp.type=LogAppendTime` (server-side timestamps only), and add explicit deadline validation in bid processor: `if (current_time > bid_deadline) reject with 'deadline expired'`. But this doesn't fully solve it—you need causal ordering. Better architecture: use Kafka for ordering guarantee only, add Redis for TTL-based deadline buffer. Incoming bid goes to Kafka topic, consumer reads with timestamp, checks Redis cache of deadline-expired auctions, only applies valid bids. Redis gives you sub-millisecond clock checking. For hard guarantees, move to event sourcing: store bid event with server-assigned timestamp in Kafka, reconstruct bid state deterministically from events. Cost: Kafka topic (base), Redis cluster for deadline cache ($200/month for 16GB), processor logic 10% more complex. Alternative (simplest): enforce single writer pattern—centralized bid ingestion service (stateless, auto-scaling) assigns timestamps, writes to Kafka, ensures causal consistency through request ordering. No clock skew if single source of truth.

Follow-up: Your bid processor scales to 50 instances consuming the same Kafka partition for throughput. How do you ensure bid deadline ordering without data loss if 3 instances crash mid-commit?

Your notification system sends SMS/email/push across 3 regions. Notification creation happens in PostgreSQL (primary US-East). You need to guarantee each user gets exactly one notification per action (no duplicates), support 10M notifications/day, and process within SLA of 30 seconds end-to-end. Current setup: SQS queue per region, async Lambda consumers. You've seen duplicate notifications: Lambda processes message twice, connection timeout causes re-delivery before commit.

SQS + Lambda without idempotency keys is the root cause. Your deduplication requirement demands idempotency at application layer. Fix: (1) Generate deterministic idempotency key in notification creator: `sha256(user_id + action_id + timestamp_second)`. (2) Store key in DynamoDB with TTL (30 days), check before processing. (3) Return cached response if key exists. For 10M/day at 115 RPS peak, DynamoDB on-demand ($0.25 per million write units) costs ~$2.50/day. (4) Enable SQS message deduplication: `MessageDeduplicationId=idempotency_key` with content-based deduplication. 30-second SLA means: single region (US) processes synchronously in 8s (DB write + cache), async fan-out to regional queues (SNS for region replication), 20s buffer for regional delivery. Don't process same message in multiple regions simultaneously—this is distribution, not redundancy. Architecture: (a) Notification service writes to DynamoDB + publishes to regional SNS topics. (b) Each region has SQS consumer (dead letter queue for regional failures). (c) Idempotency key ensures if Lambda retries, only one notification sent. Trade-off: SQS vs Kafka here doesn't matter because you're not replaying history—you're ensuring single delivery. SQS is cheaper, simpler. Kafka would force you to manage partition state, harder to scale consumers dynamically.

Follow-up: Your DynamoDB idempotency table hit a hot partition (one user receiving 100K notifications/second during a bulk action). How do you scale idempotency checks without redesign?

You maintain an event store for your financial ledger: double-entry bookkeeping with immutable events. 500 events/second, 12 event types (debit, credit, reversal, interest accrual). Consumer services subscribe to Debit and Credit events only. Your producer occasionally publishes events out of order due to async processing (reversals sometimes published before their source debit). This caused ~0.01% of ledgers to have negative balances when they shouldn't. What's the minimal fix?

Out-of-order events break financial state machines. Root cause: event producer has implicit dependencies (reversal depends on debit) not enforced. Two solutions: (1) Producer-side fix (simpler): ensure Debit is published before any dependent Reversal. Use transactional outbox pattern: write event to PostgreSQL, then read-repair loop publishes in dependency order. Adds ~50ms latency but guarantees causality. (2) Consumer-side fix (more complex): add event versioning + idempotent replay. Store each event with version `{event_id, version, timestamp}`. Consumer maintains "consumed_until_version" checkpoint. On out-of-order detection (reversal for non-existent debit), buffer event in Redis, retry with exponential backoff. This adds operational overhead—needs monitoring for buffered events, manual recovery if source event never arrives. For financial systems, producer-side is non-negotiable. Recommend: (a) Schema: add `depends_on_event_id` field to Reversal events. (b) Producer validates dependency exists before publishing. (c) Kafka ordering: same account_id → same partition ensures intra-account event ordering (if you're partitioning by account). (d) Consumer: maintain per-account state, reject out-of-order events explicitly, alert on rejection. Cost: negligible. Trade: you lose some producer throughput (round-trip to verify dependency), but financial systems require this. Regulation (SOX, PCI-DSS) typically requires exactly this audit trail anyway. At 500 events/sec, a single Kafka partition handles this comfortably (target: <100K events/sec per partition).

Follow-up: How do you safely upgrade your event schema (add new required field) across 50 running consumer instances without event loss?

Your ML training pipeline ingests raw events from Kafka for feature engineering: model retraining every 24 hours on 100GB of events. Your Kafka cluster is multi-tenant: marketing, analytics, and ML all share the same cluster. ML consumer falls behind by 3 days during heavy marketing campaign (they spiked to 50K msg/sec). When ML finally catches up, retraining used stale feature distributions, degrading model accuracy by 7%. How do you isolate this in future?

This is a resource contention + SLA mismatch. Two levels of fix: (1) Kafka-level: implement quota management. Set producer quota for marketing: 30K msg/sec (hard limit). Set consumer group priority: ML gets 3 partitions with dedicated broker resources (Kafka broker flag: `kafka.server.ReplicaManager.num.io.threads=8` for ML topic). (2) Application-level: change ML consumption model from real-time catch-up to snapshot-based. Instead of streaming Kafka → Model Training, do: Batch Export (s3_export_daily(kafka_topic, '24h')) → S3 Parquet (immutable, versioned) → Model Training reads from S3 with guaranteed dataset version. This decouples ML from Kafka congestion. At 100GB/day, S3 export costs ~$0.02/GB ($2/day data transfer). (3) Add circuit breaker: if consumer lag >4 hours, alert and skip retraining cycle. Better to miss one retraining than use stale data. Architecture: Kafka topic has retention policy (7 days), ML batch job reads last 24 hours from S3, exports include event timestamp + dataset version. This gives you reproducibility: retraining on 2026-04-06 always uses same feature dataset. Trade: lose real-time retraining (you're now batch at midnight UTC), but gain determinism and isolation. If you truly need real-time updates, spin dedicated Kafka cluster for ML (smaller hardware, higher availability target).

Follow-up: Your batch export job to S3 fails at 99% completion (23.5 hours into 24-hour window). Your next retraining still needs to run in 30 minutes. What's your recovery strategy?

You run a multi-region search service: documents indexed in Elasticsearch, updates published to Kafka. Across US, EU, and AP regions, you need <200ms P99 search latency. Your current setup: single Kafka cluster (us-east broker), regional ES instances subscribe via consumer groups. During US region degradation (broker latency spike to 5s), EU searches got stale results (10-minute index lag). Customers complained. How do you redesign for regional resilience?

Single-leader Kafka cluster is a single point of failure for regional consumers. For true regional resilience: (1) Deploy regional Kafka clusters (us-east, eu-west, ap-southeast) with active-active topology. (2) Add replication layer: cross-region topic replication (Kafka MirrorMaker or Confluent Replicator) with <500ms RPO (recovery point objective). This means if us-east dies, eu-west cluster has data from <500ms ago. Trade: storage is 3x (one copy per region), operational overhead (3 clusters to manage), but gives you fault isolation. (3) For Elasticsearch: each region has independent ES cluster fed by regional Kafka. Search queries route to local ES only (no cross-region search). Update flow: document update → regional Kafka → local ES (20ms) + async cross-region replication to other Kafka clusters (MirrorMaker, 100-500ms latency). (4) Consistency model: eventual consistency across regions (searches see local data within 20ms, cross-region syncs in 500ms). If strong consistency required, use synchronous writes to primary region only, read from secondary with caveat that data is stale. At your scale (< 1M docs typically), individual regional ES instances work fine; no need for ES replication. Deployment: 3 Kafka clusters (~$3K/month each = $9K), 3 ES clusters (~$2K/month each = $6K), MirrorMaker instances ($500/month). Total: ~$15.5K/month. Alternative (lower cost, higher latency): single Kafka cluster, add regional cache layer (Redis, 200ms ttl) to handle broker lag. ES pushes updates to Redis, searches read Redis first (20ms), fallback to ES if cache miss (100ms). Costs ~$5K/month but couples you to cache availability.

Follow-up: During a scheduled maintenance window, you take US broker offline for patching. MirrorMaker attempts to replicate from US to EU, but US is down. How do you prevent replication pipeline from blocking while maintaining data ordering?

You process click-stream data for a recommendation engine: 500M clicks/day from web and mobile. Your pipeline: Kafka → Flink jobs (sessionization, feature extraction) → DynamoDB feature store. Flink job checkpoints every 10 seconds to S3. During a deploy 3 weeks ago, your Flink job crashed mid-checkpoint, and 5 minutes of clicks (4M events) were never processed. Feature store grew stale, recommendations degraded. How do you prevent data loss and ensure exactly-once semantics?

This is checkpointing failure + state recovery. Flink's exactly-once requires two things: (1) Fault-tolerant source (Kafka) with offset commits, (2) Atomic state snapshots. Your problem: Flink crashed before committing Kafka offset, so on restart, it replayed 5 minutes of events—but your DynamoDB writes weren't idempotent, causing stale or duplicate features. Fix: (a) Enable Flink exactly-once processing mode (not at-least-once). This forces aligned checkpoints across all operators before committing Kafka offsets. Cost: 3-5% throughput hit. (b) Make DynamoDB writes idempotent: use update expressions with attribute-based deduplication. Store `(click_id, version)` in DynamoDB, only update if new version > old version. (c) Checkpoint every 5 seconds (vs 10) to minimize replay window. (d) Store Flink state in S3 with versioning: each checkpoint is immutable, tagged with timestamp. On recovery, restore from latest successful checkpoint (not mid-checkpoint). At 500M clicks/day (5.8K clicks/sec), Flink can handle this easily with 2-3 TaskManagers. Configuration: `execution.checkpointing.mode=EXACTLY_ONCE`, `state.backend.type=rocksdb`, `state.savepoints.dir=s3://state-bucket/`. Manual recovery procedure: if DynamoDB diverges, replay clicks from Kafka offset to feature store using batch job (Spark) with deduplication logic. Cost: negligible (S3 checkpoint storage ~$10/month). Trade: slightly higher latency (checkpoint alignment adds 200-300ms to full pipeline), but guarantees exactly-once which is critical for ML feature store.

Follow-up: You need to A/B test a new feature extraction logic. How do you shadow your current Flink job with new logic without losing any events or affecting production?

You run a ride-hailing platform with 100K active drivers, 500K daily trips, 2M location pings/second. Driver location updates stream through Kafka to a real-time matching service (pairs riders with drivers). Your Kafka cluster processes 2M/sec, but P99 end-to-end latency (ping sent → rider sees driver on map) is 8 seconds instead of target 2 seconds. You're considering Pulsar as replacement. Evaluate the tradeoff: does Pulsar's multi-tenancy and geo-replication justify the operational lift?

Don't change queue systems; this is a latency problem in matching service, not Kafka. Kafka can handle 2M/sec comfortably (your 8s latency suggests buffering, not Kafka throughput). Root causes likely: (a) Matching service consumer lag (10+ instances, uneven partition distribution), (b) Location update batching (pings sent every 2s, not streaming), (c) Matching algorithm O(n) iteration over drivers. Pulsar won't solve these. However, Pulsar vs Kafka comparison for ride-hailing: Kafka is simpler to operate, well-understood, proven at Uber/Lyft scale. Pulsar has advantages if you need: (1) Multi-tenancy with isolation (ride-hailing has internal teams—logistics, fraud, pricing), (2) Geo-replication with lower operational overhead (Pulsar's geo-replication is built-in; Kafka requires MirrorMaker), (3) Tiered storage (old location data to cheap storage automatically). For your case, I'd stick with Kafka + fix matching logic. If you truly need Pulsar: architecture is Pulsar topic per tenant (driver-location, fraud-signals, pricing-events), built-in replication to 3 regions, bookies handle separate tiered storage. Costs: Pulsar cluster ~$8K/month (brokers + bookies), Kafka equivalent ~$5K/month. Pulsar wins if you have 5+ internal consumer teams (isolation, independent scaling). For single matching service, Kafka is overengineered but cheaper. Actual fix for 8s latency: (a) Reduce batch window: pings every 100ms instead of 2s (may increase network usage 20x, acceptable). (b) Matching service: pre-index driver locations in memory (quad-tree or geohash), lookup O(log n) instead of O(n). (c) Kafka tuning: `linger.ms=0` (no batching), `compression.type=snappy`, partition by geohash prefix for locality. Expected result: 200-500ms P99 latency achievable without platform change.

Follow-up: Your fraud detection team needs location history for the past 30 days, but matching service only needs current location. How do you optimize data retention and query patterns across both consumers?