System Design Interview — Data Pipeline: Lambda and Kappa Architecture

You need real-time analytics on a 10TB/day clickstream. Users need to see up-to-date metrics (page views, revenue) within 60 seconds of an event. Historical data (last 6 months) must be queryable in <5s for dashboards. Should you use Lambda, Kappa, or Batch? What are the tradeoffs?

Architecture Decision Matrix:

Batch (Bad Choice Here): Runs job every hour. Processes 10TB in 2 hours. Metrics are 2 hours stale. Users see yesterday's data on dashboards. Unacceptable for "real-time" requirement.

Lambda Architecture (Good Fit): (1) Real-time layer: Kafka ingests raw events (10TB/day = 115KB/sec average, spikes to 1MB/sec). Stream processor (Flink/Spark Streaming) reads Kafka, aggregates in-memory: "last minute page views = 50K, revenue = $2K". Emits metrics every 10s to Redis. Latency: <500ms. (2) Batch layer: Daily batch job reads all 10TB from S3 (raw events dumped there), processes with Spark SQL, writes aggregated results to data warehouse (Redshift). This corrects any stream processing errors (lost messages, etc.). (3) Serving layer: Dashboards read from Redis (real-time, stale <1min) for live metrics. For historical queries, read from Redshift (accurate, slower but <5s for day queries). Result: real-time requirement met (60s) + historical accuracy (batch corrects stream).

Kappa Architecture (Simpler, Scalable): Skip batch layer entirely. Kafka is the system of record. Stream processor (Flink) reads all of Kafka from beginning, computes aggregates, writes to two outputs: (1) real-time metrics → Redis (derived from current stream position), (2) historical aggregates → S3 (computed offline). When new code deployed, restart Flink from beginning of Kafka to recompute everything. Result: no batch layer complexity, but requires Kafka to retain 6 months of data (expensive). Better for smaller data volumes (<1TB/day).

Recommendation: Use Lambda. 10TB/day is large enough to justify batch complexity. Stream layer handles real-time need, batch layer ensures accuracy.

Follow-up: After 3 days running Lambda, you notice stream processor lost 0.2% of events (bug in Kafka consumer). Batch job ran on stale data and didn't catch it. Now your metrics are permanently off. How do you detect and fix this?

Your Kafka cluster processes 100K messages/sec. Stream processor Flink job runs with 10 parallelism (10 consumer threads). One thread becomes slow (disk I/O contention), causing the others to wait for it (this is "backpressure"). Latency jumps from 100ms to 30 seconds. What's happening and how do you fix it?

The Problem: Straggler Slow Consumer: Kafka partitions messages round-robin across 10 Flink tasks. Normally each task processes ~10K msg/sec. But task #3 hits a GC pause (stop-the-world garbage collection), stops processing for 5 seconds. Messages pile up in task #3's queue. Meanwhile, other tasks finish their work and wait for task #3 to catch up (Flink waits for all tasks at checkpoint boundaries). Latency skyrockets.

Root Cause Detection: Check Flink UI: look at "Backpressure" metric. If it's >0.5, you have backpressure. Check per-task latency: one task at 30s, others at 100ms. That's the straggler.

Fix #1: Increase Parallelism: Run with 15 tasks instead of 10. Task #3 now processes 6.6K msg/sec. If one task stalls, others can catch up in 1-2 seconds instead of getting stuck. But this is a band-aid—doesn't fix root cause.

Fix #2: Fix the GC Pause: Profile the Flink job: why is GC pausing for 5s? Likely cause: state grows unbounded (Kafka offset tracking, join state, etc.). Solution: (A) Increase heap size for more GC breathing room, or (B) reduce state size (emit intermediate results earlier instead of holding in memory). (C) Use low-pause GC (G1GC, ZGC) instead of default GC.

Fix #3: Better Partition Assignment: Instead of round-robin, assign partitions intelligently: if partition #5 produces 50% of traffic (hot partition), assign it to a dedicated task. Less hot partitions → shared tasks. Load balancing reduces stragglers.

Result: Latency back to 100ms. Backpressure <0.1.

Follow-up: After fixing GC, you notice latency is still 5s, but Backpressure metric says it's healthy (<0.1). What's the disconnect—how can latency be high but backpressure low?

You process financial transactions. Latency requirement: end-to-end (event → aggregation → database write) must be <500ms. Your Lambda architecture has: (1) Stream: Kafka → Flink aggregation → Redis (100ms). (2) Batch: S3 → Spark SQL → Redshift (every 6 hours). But batch job fails midway (timeout after 3 hours, hasn't finished). Redshift becomes stale. How do you handle this failure gracefully?

The Failure Scenario: Batch job crashes at 3-hour mark (halfway through 6TB processing). Redshift still has data from yesterday (24 hours stale). Users query dashboards and see yesterday's transactions. Financial reconciliation breaks.

Graceful Degradation (Kappa-Style Fallback): (1) When batch job fails, don't wait for retry. Instead, directly copy current stream aggregates (from Redis) to Redshift as a "fallback" source. This is less accurate (stream might have lost 0.1% of events) but it's better than 24-hour stale. (2) Emit a warning: "Metrics are slightly stale due to batch processing failure. Accuracy: 99.9%." (3) Start repair job: retry batch processing, but with checkpoint resume (if it crashed at 50%, restart from 50% instead of 0%). (4) Once batch completes, replace fallback data with accurate batch results.

Better Approach: Batch as Correction, Not Primary Source: Redesign: (1) Stream aggregates → primary metrics (Redshift). (2) Batch job does NOT overwrite metrics, only validates them. If batch result differs >0.5% from stream, alert (data quality issue). (3) This makes batch failure non-blocking: stream is always the source of truth. Batch is just quality check.

Result: Batch failure doesn't cause stale data. Users always see stream-based metrics (<60s old). Batch helps catch drift over time.

Follow-up: After implementing stream-as-primary, you notice stream drifts 2% per day due to small processing bugs. After 30 days, stream is 60% off from truth. How would you catch this earlier?

You need to deploy a new Flink job to production. The new job has different aggregation logic (it groups by (country, product) instead of just (product)). You must migrate all state from old job to new job without losing in-flight events. How do you do this?

The Challenge: Old job has state: "aggregated_metrics" = map of (product_id → metrics). New job needs: "aggregated_metrics" = map of ((country_id, product_id) → metrics). State schema changed. Can't just copy state.

Blue-Green Deployment: (1) Keep old job running (blue). Deploy new job as secondary (green). (2) For 10 seconds, send traffic to both blue and green: Kafka events → both jobs simultaneously. (3) Blue job emits results to Redis (old schema). Green job emits to Redis with different key prefix (new schema). (4) At T=10s, redirect all reads from blue schema to green schema (application layer switches which Redis key it reads). (5) Verify new results match old results for 5 minutes (data quality validation). (6) Shut down blue job. Green is now primary.

Avoid State Migration Directly: Don't try to transform state in-place. Too risky. Instead, let new job recompute state from Kafka. If Kafka retains 7 days of data, new job can replay from 7 days ago, recompute all aggregations, and converge to current state in 2-3 hours. During convergence, serve reads from old job. Once new job catches up, switch.

Cost: Blue-green setup temporarily doubles compute (2 jobs running). But eliminates risk of state corruption.

Follow-up: During blue-green overlap, Kafka sees duplicate events: both jobs consume the same events. If job is non-idempotent (adding same event twice doubles the metric), metrics get skewed. How do you handle this?

Your event stream has late-arriving data: an event happened 2 days ago but only arrives in Kafka today (delayed network, offline mobile app sync, etc.). Your stream processor aggregates by event time (actual time event occurred) not processing time (when system received it). How do you handle 2-day-late events without recomputing everything?

Late Data Problem: Event: "User clicked product on Jan 1" only arrives Jan 3. Your aggregation for Jan 1 already published. If you ignore late event, metric is off by 0.01%. If you reprocess Jan 1, you must re-emit all downstream aggregations (expensive). What's the balance?

Solution: Allowed Lateness Window: (1) Stream processor groups events by event_time into 1-minute windows. (2) For each window, buffer events for 2 days: if an event for a past window arrives within 2 days, merge it into that window's aggregate. (3) After 2 days, window is "closed"—no more events accepted. (4) When window fires (emits results): emit immediately (preliminary), then emit again if late events arrive (final). (5) Downstream systems should expect multiple emissions per window: update incrementally (if window #1 fires at 10:05, emit results. At 10:37 (late events), emit updated results for window #1. System applies delta).

Configuration: In Flink: allowedLateness(2 days). Window fires on event time boundary + 2-day grace period.

Cost-Benefit: Allows 99.5% of events to be correctly included. Requires downstream systems to handle multiple emissions per window (idempotent updates).

Follow-up: With 2-day allowed lateness, Flink maintains state for every 1-minute window for 2 days. That's 2880 windows in memory. For 1000 metrics per window (product × country breakdown), state explodes to 2.88M metric objects. Memory pressure causes OOM errors. How do you cap memory without losing late data?

Your e-commerce platform ingests order events (10TB/day) to compute real-time revenue dashboards. Requirement: revenue is accurate within 1%. You notice stream processor has occasional bugs (drops 0.5% of events randomly). Batch job catches this 6 hours later. How do you detect this bug faster and correct the stream?

The Drift Problem: Stream says "revenue in last hour = $500K". Batch (6 hours later) recalculates and says "revenue should be $502.5K" (0.5% more). Where did the $2.5K go? Lost in stream processing.

Solution: Real-Time Data Quality Validation: (1) Run two stream processors in parallel (hot-hot deployment): (A) Main processor: Kafka → aggregation → Redis. (B) Validation processor: Kafka → independent implementation of same logic → separate Redis. Both process same events. (2) Every 5 minutes, compare results: if Main ≠ Validation by >0.1%, fire alert. (3) Human investigation: why do they differ? (Maybe Main has a bug, maybe Validation has a bug.) (4) Deploy fix, restart the buggy processor. (5) As next layer of defense: Batch job runs every 6 hours, compares against both stream processors. If Batch differs from Main by >1%, recompute Main from scratch (throw away stream state, replay from Kafka).

Cost: 2x compute for dual stream processors. But you catch drift in 5 minutes instead of 6 hours. Revenue reconciliation stays accurate.

Follow-up: Dual processors now take 3 minutes to detect drift (they disagree). But 3 minutes × 10TB/day = 1.2GB of drift before you know. How do you reduce detection latency further?

Your company switches payment processors (A → B). New processor sends events in different format. Your stream processor needs to support both formats simultaneously during transition (1-month overlap period). After 1 month, processor A is dead and you can remove old format support. How do you design this migration?

Schema Evolution Strategy: (1) Define schema versioning: events from processor A have "version": 1, processor B has "version": 2. (2) Flink job has conditional logic: if version == 1, parse as old schema. If version == 2, parse as new schema. Both paths lead to same downstream aggregation. (3) Monitor metrics: emit counter "events_by_version". For 1 month, you should see: version=1 decreasing, version=2 increasing. (4) One-way backwards compatibility: new schema can be read by new processor, but new events can be re-read by old processor for debugging. (5) On day 31, remove old schema support (version 1 conditional is dead code). If any version=1 events still arrive, job fails loudly (circuit breaker).

Version Compatibility: Don't break schema by adding required fields. Use optional fields: old version doesn't have "processor_id", new version includes it but makes it optional. This prevents deserialization errors.

Testing: Before switch, test both processors in shadow mode: run new processor in parallel, compare output. After 1 month of shadow testing, switch primary.

Follow-up: After 3 weeks, you realize new processor is generating 2x event volume because it's double-counting some transactions. Your dual stream processors both report the same high count (they both parse the new schema). How would you detect this without comparing to a third system?

Your Kappa architecture uses Kafka as the system of record for 1 year of historical data. But Kafka storage fills up (400GB cluster, 80% full). Adding storage is expensive. New requirement: keep 3 years of historical data queryable. What's your strategy?

The Scaling Problem: Pure Kappa works for <6 months of data. Beyond that, Kafka storage becomes prohibitive. You need to offload cold data.

Hybrid Kappa-Lambda (Lamda in reverse): (1) Keep only recent data in Kafka (60 days). This is the "hot" data, accessed frequently by stream processor and interactive queries. (2) Archive older data: every 60 days, take a Kafka snapshot (all messages from days 0-60), compress it, archive to S3. This is "cold" data. (3) Query strategy: Interactive queries (dashboards) read from stream processor results (in cache/DB, recent). Historical queries (analyst runs ad-hoc query) read from S3 archives. (4) Reprocessing: if you deploy new Flink job, replay from S3 + Kafka combined. S3 gives 1 year history fast, Kafka provides latest 60 days without gaps.

Storage Savings: 400GB Kafka for 60 days. 6 years of archive on S3 = ~24TB compressed (4× cheaper than Kafka). Total cost: 1 Kafka cluster ($50K/year) + S3 storage ($1K/year) = $51K. Pure 3-year Kafka would be $150K.

Follow-up: When replaying from S3 + Kafka combined (1 year history), reprocessing takes 8 hours instead of 2 hours (S3 read is slower than Kafka). During those 8 hours, new events arrive in Kafka. Your new Flink job crashes at hour 6 of reprocessing. Where does it resume—from S3 or Kafka?