AWS Interview — Kinesis vs SQS vs MSK: When to Use What

Your mobile analytics team receives 50K events/sec from 2M users. Events must be consumed in <2 second latency, and you have 15 consumer groups analyzing the same stream (marketing funnels, fraud detection, user segmentation). Your current SQS setup causes duplicate processing and queue depth swells to 100M. You're considering Kinesis Data Streams, MSK (Kafka), and staying with SQS. What drives your decision?

At 50K/sec with 15 consumer groups and sub-2s latency requirements, Kinesis Data Streams is the optimal choice over SQS. Key reasoning: (1) Kinesis fan-out architecture allows 15 consumer groups to read the same partition independently without queue depth scaling; SQS would require 15 separate queues and complex deduplication logic. (2) Kinesis shard auto-scaling handles 50K/sec with 2-3 shards; costs roughly $0.015/shard/hour = $0.03-0.05/hr for shards plus $0.34/million PutRecord requests = $1.5K-2K/month. (3) Kinesis guarantees shard-level ordering and exactly-once delivery semantics; SQS visibility timeouts don't prevent duplicates in distributed systems. (4) Enhanced fan-out (DescribeStream + push to 20 consumers) keeps latency <1s vs SQS receive_message polling at 3-5s minimum. MSK adds operational burden: clusters require EC2 fleet management, networking config, broker failover tuning—overkill unless you need >200K/sec or complex Kafka ecosystem features (transactions, stream processing topology).

Follow-up: Your fraud detection consumer lags 30sec behind real-time due to network jitter. Shard iterator expires and you lose messages. How do you design for this scenario?

A backend service initially used SQS FIFO for order processing with guaranteed ordering. After launch, you notice: (1) throughput capped at 300 msg/sec (SQS FIFO hard limit), (2) cost per message is 3x SQS standard, (3) consumers fail and retry logic causes head-of-line blocking for 5 minutes. Your CTO asks: "Should we switch to Kinesis or scale consumers?" What's your recommendation and why?

Switch to Kinesis with shard-level ordering (partition by order_id) rather than global FIFO. SQS FIFO's 300 msg/sec limit is a hard ceiling because all messages hash through a single shard for ordering. Kinesis sharding strategy: (1) Create 10 shards (100K/sec capacity), partition by `order_id % shard_count`. (2) Each shard maintains ordering within its partition; unrelated orders process in parallel. (3) Retry logic decouples from head-of-line: a failed order in shard-5 doesn't block shard-1's orders. (4) Cost drops from ~$3K/month (SQS FIFO at 300 msg/sec) to ~$1.2K/month (Kinesis with 10 shards at distributed throughput). (5) Kinesis iterator timestamps let you replay from failures—SQS visibility timeout is all-or-nothing. Tradeoff: you lose true global ordering, but achieve business-level ordering guarantees where it matters (same order_id).

Follow-up: What if your product manager requires true global FIFO ordering across all orders, and you can't partition? How do you handle 5K+ msg/sec throughput?

Your platform ingests clickstream data via Kinesis (24-hour retention) and you're building a real-time dashboard. The dashboard Lambda reads from Kinesis but suffers 30sec+ latency spikes. You trace the issue: Kinesis GetRecords() returns 10MB batches taking 3-5sec to deserialize, then 25sec stuck in Lambda. Consumer is running 1 concurrent executor. What's the bottleneck and fix?

The bottleneck is synchronous GetRecords processing on a single Lambda. Fixes in priority order: (1) Kinesis Enhanced Fan-Out: Replace GetRecords polling (5 requests/sec limit, 2sec latency floor) with SubscribeToShard (push model, 1sec latency ceiling, burst-proof). Cost: $0.25/consumer/shard/hour instead of per-request pricing, but worth it for dashboard use case. (2) Lambda concurrency: Increase reserved concurrency from 1 to 50+; Kinesis event batching delivers 100 records max to each execution. Parallel executions hide deserialization overhead. (3) Batch optimization: Add batch-on-error handling; if 5/100 records fail deserialization, don't throw—process valid ones, send failures to DLQ. Achieve 94-95% throughput. (4) CloudWatch metrics: Add custom metric for GetRecords latency and record count. Set alarm if GetRecords > 2sec, triggering shard scaling. Result: latency drops from 30sec to <3sec (Kinesis push 1sec + Lambda exec 1-2sec).

Follow-up: Dashboard is now real-time but backfill history (72 hours) takes 8 hours. Your team wants to use stream to backfill. Is this wise? What's the production-safe approach?

Your team uses Kafka on EC2 (3-broker cluster) for multi-tenant data pipelines. You're considering migrating to MSK. Current setup: 120GB/day ingest, 4 consumer groups (each with 10 consumers), broker failover takes 45min due to manual config sync, and you spend 8 hours/month on Kafka tuning (JVM heap, compression ratios). Your cloud architect says "MSK costs 2.3x but ops-free." Sell the migration to your CFO with ROI math.

MSK ROI case: (1) Current EC2 cost: 3x m5.2xlarge = $0.384*3*730 = $840/month + 300GB EBS = $30/month = $870/month. (2) MSK cost for 3 brokers, 120GB/day: ~$0.77/broker/hour = $1.69/hr = $1,234/month (higher, but includes brokers + managed features). (3) Ops savings: eliminate 8 hours/month tuning = $3K-4K/month value (sr eng at $200-300/hr). (4) Incident cost reduction: current 45min failover = $1.5K downtime cost (worst case); MSK auto-failover = 90sec, prevents 20+ incidents/year = $30K/year saved. (5) Rebalancing: MSK handles topic rebalancing without manual intervention; EC2 requires 2-3 hours planning. (6) Net: $1,234 MSK cost - $870 EC2 - $3.5K annualized ops = $365/month net, but MSK eliminates existential risk of cluster corruption (happened 2x in past 2 years). Recommendation: migrate. Timeline: 4 weeks (testing → canary → full traffic cutover).

Follow-up: MSK migration is live but one consumer group (analytics team) is 50+ minutes behind. Other groups are real-time. What's your debugging approach?

You're designing a new payments platform that processes credit card transactions. Requirements: <100ms P99 latency, eventual consistency is acceptable, you need ordered writes per customer (atomicity), and 10K transactions/sec. Your architect proposes Kinesis with Lambda consumers writing to DynamoDB. Another proposes SQS + Lambda + DynamoDB. A third suggests Kafka/MSK. Pros and cons of each for payments?

For payments, I rank them: (1) SQS + Lambda + DynamoDB is risky—SQS has no ordering per customer_id without FIFO mode (capped at 300 msg/sec). FIFO mode guarantees order but violates <100ms latency (visibility timeout adds 10-30sec on failures). Verdict: too slow, not recommended. (2) Kinesis + Lambda + DynamoDB is solid: shard per customer_id guarantees order (<100ms achievable with Enhanced Fan-Out + direct writes). Latency: Kinesis push 1sec + Lambda 10-50ms + DynamoDB 5-15ms = 16-66ms P99. Cost: cheap for 10K/sec (12 shards = $180/month + Lambda/DDB costs). Risk: Lambda cold starts on shard scale-up; mitigate with provisioned concurrency. (3) MSK is overkill unless you need complex stream processing topology (e.g., fraud detection + velocity checks in same pipeline). Adds 50-200ms latency due to Kafka commit protocol vs Kinesis atomic writes. Verdict: Kinesis wins for payments. Implementation: idempotency keys in transaction ID, Kinesis StreamConsumer for <100ms push, DynamoDB conditional writes (transaction_id as sort key) prevent duplicates on retry.

Follow-up: A customer complains they were charged twice for the same transaction. How do you investigate idempotency across Kinesis + Lambda + DynamoDB layers?

Your team has 3 consumer groups on the same Kinesis stream: realtime analytics (1s lag OK), machine learning (10min lag OK), and compliance audit (24-hour lag, needs 100% message retention). Each group has wildly different SLA. Your current approach: one stream + three Lambda functions. Compliance group Lambda times out hourly on large batches, and you're considering 4 separate streams. Is splitting streams justified? What's the alternative?

Don't split streams; use Kinesis Firehose for compliance audit instead. Rationale: (1) Kinesis stream itself is stateless—cost is per shard regardless of consumer count. Adding a 4th stream duplicates shard costs ($0.015/hr per shard). (2) Problem is consumer, not stream: compliance Lambda timeout = batch size is too large (GetRecords returns 10MB, Firehose batching adds more). Solution: configure Enhanced Fan-Out for compliance, set batch window to 300sec (5min), and batch size to 100 records max. This prevents timeouts. (3) Alternative: Kinesis Firehose bypasses Lambda entirely. Configure it to read from Kinesis stream, buffer to S3 with Parquet format (transformations via Lambda), and deliver batches every 300sec or 128MB. Firehose handles all batching, retries, and schema evolution. Cost: $0.03/hour per delivery stream + $0.02 per gigabyte delivered = cheaper than 4 Lambdas timing out hourly. (4) 24-hour retention: attach Firehose to stream, let it buffer to S3 indefinitely. Kinesis can stay at 24-hour retention, Firehose archives to Glacier for long-term compliance. Result: compliance workload decouples from real-time consumers, no timeout risk.

Follow-up: Firehose is now writing to S3 but you're seeing duplicate records in the audit trail. Your company is SOC2-audited. How do you ensure idempotent writes for compliance?

A critical service depends on consuming Kinesis stream and has SLA: "zero message loss, process within 5 minutes." Last week, a consumer Lambda crashed and didn't checkpoint for 12 minutes. When redeployed, it reprocessed 7K messages from 12 minutes ago (shard iterator logic). Your DynamoDB table now has duplicate data. You're considering: (1) move to SQS with visibility timeout, (2) add idempotency layer in DynamoDB, or (3) replay logic in Lambda. What's production-ready?

Option (2) - idempotency layer in DynamoDB - is the only production-ready solution for Kinesis. Reasoning: (1) SQS with 5-minute visibility timeout is fragile; visibility timeout can extend indefinitely if consumer crashes before delete. SQS doesn't guarantee FIFO without mode enabled (300 msg/sec limit). If you need FIFO + durability, you're back to Kinesis anyway. (2) Option (3) replay logic adds complexity; you'd need to track which messages were processed and which weren't. This is just re-inventing idempotency. (3) Idempotency layer is the pattern: add `idempotency_key` (e.g., hash of Kinesis record ID + timestamp) to each DynamoDB write. Schema: `{ message_id: "kinesis-record-id-shard-seq", idempotency_key: "hash", data: {...}, processed_at: timestamp }`. Before inserting, check if `idempotency_key` exists; if yes, skip. This handles both reprocessing from same shard iterator and replay from CloudWatch Logs. (4) Lambda configuration: enable X-Ray tracing to log `sequence_number` of each batch processed. If crash occurs, next invocation queries DynamoDB for last processed sequence number and starts from next record. Implementation: 50-line idempotency decorator, 100ms overhead per batch, eliminates duplicate data risk entirely. Cost: DynamoDB scan for idempotency check = 1-5 RCU per batch (negligible vs data corruption risk).

Follow-up: Your idempotency table is growing at 1M rows/month. Storage is cheap but queries are slow (scan latency 200ms on 50M items). How do you optimize?

You're redesigning a data pipeline to support both real-time (Kinesis) and batch (S3 + Spark) processing. The problem: real-time consumers consume the stream but also need historical data. S3 acts as the source of truth (24-hour backfill), but during outages, real-time and batch data diverge (real-time Kinesis stops, S3 stale). Your team asks: "Should we use Kinesis for both real-time and batch?" What's your architecture recommendation?

Use a "cold + hot" architecture: Kinesis for hot (real-time), S3+Parquet for cold (batch). Decoupling is essential because they have inverse cost/latency tradeoffs. (1) Hot layer (Kinesis): real-time consumers read Kinesis with 1-2 sec latency. Enhanced Fan-Out, 12-hour retention, 5 shards = $0.8K/month + $0.5K compute. SLA: 99.9% availability, P99 <100ms. (2) Cold layer (Kinesis Firehose → S3): same events written to Firehose, buffered to S3 with Parquet format every 60 seconds or 128MB. Cost: $30/month Firehose + $20 S3 storage. (3) Reconciliation job (Lambda + Step Functions, 1x/day): read last 24 hours of S3 Parquet files, cross-check row counts vs Kinesis GetRecords(24 hours ago). If divergence >1%, alert; trigger manual audit. (4) On Kinesis outage: real-time consumers queue to DLQ (Lambda Dead Letter Queue), batch consumers read from S3 snapshot from 1 hour ago. Result: eventual consistency, no data loss, ~$1K/month cost. Tradeoff: 5-60 minute lag between real-time and batch during failures (acceptable for most use cases). If you need zero lag, add Kafka Connect to replicate Kinesis → S3 in real-time, but complexity explodes.

Follow-up: A batch job reads S3 Parquet but Kinesis is 3 hours ahead. You need to join them. How do you handle timestamp skew?