System Design Interview — CAP Theorem and Consistency Models

Your distributed database (3-region replication: US-EAST, EU, APAC) suffers a 15-minute network partition. US-EAST and EU split from APAC. During partition: EU receives user1 = "Alice" write, APAC receives user1 = "Bob" write. After partition heals, you see both values in logs. Walk through the consistency issue and your recovery strategy.

This is a CAP inconsistency during partition. During split: EU (with US-EAST) chooses Availability, accepts user1=Alice. APAC chooses Availability independently, accepts user1=Bob. Both write succeed with high local quorum. Post-heal: distributed log shows two conflicting writes with timestamps (Alice: 1000ms, Bob: 1002ms). Strategy depends on workload: (1) LWW (Last-Write-Wins): timestamp-based, Alice wins (1000 < 1002). Risk: Bob's user is silently overwritten. (2) Conflict-free replicated data type (CRDT): merge Alice + Bob into set {Alice, Bob}, let client decide. (3) Application-layer conflict resolution: check timestamp + user context (Bob is secondary account, discard). Best practice: Use CRDTs for user profiles + versioned writes with conflict markers. Log Alice/Bob, alert user "Your profile was modified from 2 sources, choose version." This trades CP (Consistency + Partition-tolerance) during partition for AP (Availability + Partition-tolerance) recovery, requiring human intervention.

Follow-up: If this user's account was a payment method, how would your consistency strategy change? Walk through 2 write scenarios with $ at stake.

Your eventually consistent e-commerce system writes order to US-EAST master, replicates async to APAC replica (200ms latency). A customer immediately queries APAC replica (same session), sees order status = "pending" but it shows as "shipped" on US-EAST. Customer support gets confused. How do you architect this to prevent read-your-writes inconsistency?

This is a read-your-writes consistency violation. Session-stickiness alone won't work (customer may use different device/browser). Solutions: (1) Write-through consistency: after write succeeds on master (US-EAST), write metadata (order_id, commit_timestamp: 1000ms, version: 1) to cache. Client-side check: before reading, add query hint "read_after_timestamp=1000ms" to APAC. APAC waits for replication lag to pass before replying. Overhead: ~200-500ms additional latency. (2) Quorum reads: client writes to master, then reads from 2/3 replicas. If APAC hasn't replicated yet, quorum read returns stale data from US-EAST master instead. (3) Causal consistency with vector clocks: order write increments vector[US-EAST]=1. On APAC read, client provides vector, APAC returns only if its vector[APAC] >= client vector[APAC]. (4) Session token: master returns token (includes commit LSN=1000ms). Client attaches token to APAC read. APAC checks: "Have I replicated past LSN 1000ms?" If not, block read until sync. Best for e-commerce: implement (4) + cache-aside (1). Cost: ~50ms worst-case read latency vs 200ms replication lag—customers accept this trade-off.

Follow-up: Your cache is now the source of truth for session tokens. Cache TTL expires mid-session. What happens and how do you recover?

A multi-region DynamoDB setup (Global Tables with 3-region replication, 100ms cross-region latency) processes high-frequency updates to a counter (user activity log). Region A increments counter to 100, Region B simultaneously increments same counter to 100. Both regions have LWW enabled. What's the final value and is this a bug?

Classic distributed counter consistency bug. Both regions write independently with timestamps from their NTP clocks. Region A: counter=100, timestamp=1000000000.123 (UTC). Region B: counter=100, timestamp=1000000000.567 (B's clock ~400ms faster). After sync, LWW sees Region B's timestamp is later, so counter=100 (Region B). Actual total should be 200—you lost Region A's increment. This is a fundamental issue with LWW on concurrent increments. Solutions: (1) Strong consistency: use Conditional Updates. Before increment, read current value (1-way consistency), write with optimistic lock version. Cost: every increment becomes 2 round trips, latency increases 2x. (2) Eventual consistency counter shards: each region maintains its own shard counter, queries sum all shards (eventual consistency). Increment local shard (single-region write), async replicate. Counter can be temporarily stale, but increments never lost. (3) Distributed counter with Lamport timestamps: each increment tagged with (logical_clock, region_id, actual_value). On sync, apply all increments in order, sum correctly. Region A: (1, A, 1), Region B: (1, B, 1) → final counter = 2. (4) CRDT (Grow-Only Counter): Region A += 100 → {A: 100}, Region B += 100 → {B: 100}. Merge = {A: 100, B: 100}. Query returns sum = 200. Best for high-frequency: use (4), accept that counter=sum of all regions.

Follow-up: If this counter is tracking "balance" (financial), not activity log, does your approach change? Why or why not?

Your Cassandra cluster (RF=3, quorum=2) experiences a 30-second slow network degradation (not full partition). Reads hit 1000ms latency instead of 50ms. Some clients time out after 500ms. After network recovers, read-repair shows 20% of rows have stale replicas. Walk through the consistency model and repair strategy.

Cassandra with RF=3, quorum read (2/3 succeed) should tolerate 1 replica down. But slow network (not down) is worse: all 3 replicas respond slowly. Client reads hit all 3 replicas, waits for 2 to respond = 1000ms max. Some clients timeout (500ms < 1000ms), retry to different coordinator, get different 2 replicas. Meanwhile, slow replica didn't return write, so it's stale. Post-recover: read-repair compares 3 replicas on every read, detects 20% stale. Immediate fix: background read-repair daemon (enabled in Cassandra) compares all replicas on async, writes repairs. Cost: 3x IO on every read (compare all 3), but fixes inconsistency. Consistency model applied here: Quorum consistency + read-repair = eventual consistency in ~2-5 minutes. For strong consistency: upgrade to quorum=3 (all replicas), but now 1 slow replica blocks all reads (terrible for 1000ms latency scenario—you'd want quorum=2 + background repair). Best approach: keep quorum=2 for availability, run read-repair in background, monitor hinted-handoff queue (failed writes to down replicas). If hinted-handoff queue grows, escalate—means a replica is truly dead, not just slow. Replace it.

Follow-up: Hinted-handoff queue hits 50K items and grows. Replica is still slow, not dead. Do you drain the queue or mark replica as failed?

You're building a financial trading system (settlement consistency: must guarantee money never double-spent across replicas). You have 2-region active-active setup with asynchronous replication. A user attempts to buy $10K stock but account has only $10.5K. Both regions race to process—US processes, EU processes ~50ms later seeing stale balance. Both approve. How do you prevent this?

This is a CRITICAL consistency failure—balance violates invariant (balance >= 0). Async replication violates strong consistency requirement. Solutions ranked by consistency: (1) Strong Consistency (Serialize): All writes to US master only, no active-active. EU reads go to US (higher latency but correct). Replication is read-only replica, never takes writes. Cost: EU users see 150ms latency to US. (2) Consensus-based (Raft/Paxos): Both US and EU must agree before write succeeds. Requires quorum (both regions). If one region down, all writes fail (CP). Recovery: ~5 seconds after failure detected. Cost: write latency increases to 150ms (round-trip to other region). (3) Sharded master: Users in USA trade via US master, EU users via EU master. No cross-region writes on balance. Transfers between regions require 2-phase commit (slow, 2-5 seconds). Cost: complex, but works if trades are mostly intra-region. (4) Pre-authorization + post-settlement: allow trade to happen optimistically (speculative), but add "hold" on balance in both regions. After 5 seconds, verify no double-spend. If detected, reverse one trade, charge $50 penalty. Keeps latency low but risks are high. Best for financial: **(2) Consensus-based Raft**. Write a trade: US and EU quorum agree → write succeeds. Single region down: writes blocked, but no inconsistency. Trade-off: 150ms write latency, 100% correctness.

Follow-up: Your Raft cluster has 5 replicas (3 in US, 2 in EU). US loses 1 replica temporarily (network glitch, comes back in 30s). Do writes still succeed?

Your MongoDB cluster (primary + 2 secondaries, 3-region) uses writeConcern: "majority". A write succeeds to primary + 1 secondary (2/3 quorum), but both are in same data center. Network partition splits: primary DC (with 2 replicas) vs 1 lonely secondary in EU. After 10 seconds, primary DC realizes partition and steps down. What's your consistency state and recovery?

MongoDB writeConcern "majority" = write succeeds when majority of replicas confirm. Here: primary + 1 secondary (same DC) = 2/3 majority. Problem: both in same DC means single point of failure (DC power loss = both down). Write was acknowledged as "durable," but if primary DC loses power, both replicas AND the primary die. EU replica has stale data. On partition: primary detects isolation from EU after ~10 seconds (heartbeat timeout), steps down. MongoDB freezes writes to prevent divergence. EU replica can't be promoted (unknown if it's actually in majority). Consistency state: **Write was acknowledged but lost in event of primary DC outage**—this violates durability guarantee. Recovery: (1) If primary DC recovers: it re-elects as primary, writes are safe. (2) If primary DC is dead: EU replica auto-promotes (after 30-second election timeout), losing all writes that were "acknowledged" to primary+1 in same DC. Best practice: writeConcern "majority" requires 3+ separate AZs. Specify writeConcern with tags: primary (US-EAST-1A) + 1 secondary (US-EAST-1B) + 1 secondary (EU-WEST-1). Only succeed write if [US + EU], never [US + US]. MongoDB allows this via writeConcernTag. Cost: every write now requires cross-region latency (~150ms), but durability is real.

Follow-up: How do you set up writeConcernTag in MongoDB to enforce geo-distributed majority?

A messaging system uses Kafka (RF=3, min.insync.replicas=2) with 3 brokers in US and 2 in EU. During EU network latency spike (200ms → 2s), EU brokers fall behind replication. A producer publishes message, Kafka acks after 2 replicas confirm (both US). Then US DC has power failure. EU brokers missed the message entirely. Producer thinks it's durable; it's not. Walk through the trade-off and fix.

Classic min.insync.replicas (ISR) misconfiguration. min.insync.replicas=2 means "ack after 2 replicas in ISR (in-sync replicas) confirm." ISR is *dynamic*: if EU brokers lag >replica.lag.time.max.ms, they fall out of ISR. So 2 replicas = both US. Message published, US acks, then US power loss = message gone. This violates durability SLA if min.insync.replicas promise is misunderstood. Fixes: (1) Set min.insync.replicas=3, require all 3 geographies confirm. Cost: write latency now bottlenecked by slowest replica (2s during EU lag). Publishers timeout, backpressure increases. (2) Use log.flush.interval.ms and log.flush.interval.messages to fsync after every write (OS-level durability). But this kills throughput—Kafka at 100K msgs/sec, fsync = 1K msgs/sec. (3) Adjust replica.lag.time.max.ms to 5000ms (allow EU brokers to lag up to 5 seconds). Then min.insync.replicas=2 can wait longer for EU catch-up. Cost: latency tail increases. (4) Geo-aware replication: send message to US (RF=2 in US) first, then async replicate to EU. Accept EU data loss during US failures. Best practice: **(1) + (3) combined**: min.insync.replicas=2 (US-EAST, US-WEST), EU replica non-blocking. Ack after 2 US replicas. If US DC fails, EU can failover but may lose messages. Trade-off accepted: availability > durability for messages (can retry). If durability critical: min.insync.replicas=3 + write path includes {US-primary, EU-sync} tags.

Follow-up: Your producer is now timing out because min.insync.replicas=3 and EU is slow. How do you diagnose and whether to increase timeouts or change replica strategy?

Your database layer uses multi-master replication (MySQL with master-master: US master and EU master, bidirectional sync). A customer updates their profile photo in US master at 12:00:00.000 UTC, update is "photo_url=new.jpg". EU master processes a DELETE of same row at 12:00:00.050 UTC. Both updates replicate to each other (150ms later). US now has photo_url=new.jpg (DELETE loses), EU now has photo_url=new.jpg (but row should be deleted). Walk through the conflict and suggest resolution.

Multi-master replication without version control causes conflicts. US DELETE vs EU UPDATE on same row, overlapping write windows. Replication conflict resolution (MySQL): (1) Last-write-wins (timestamp): DELETE has timestamp 12:00:00.050, newer than UPDATE 12:00:00.000. DELETE wins in theory. But DELETE removes row, then UPDATE tries to replicate, finds row missing, **INSERT missing row + photo_url**. Net result: row is resurrected with new photo. This is inconsistent. (2) Collision detection: MySQL async replication doesn't detect this until row appears in both masters' binlogs. By then, both masters have diverged. (3) Version vector (lamport clock): every write includes (logical_clock, server_id, action). US UPDATE: (1, US, UPDATE), EU DELETE: (1, EU, DELETE). Both have same logical clock—unordered. Deterministic tie-break: server_id. US < EU alphabetically, UPDATE wins. But this is arbitrary. Best fix: (1) **Avoid master-master for mutable rows**: designate US master as authoritative, EU master as read-only replica. All writes go through US, replicate to EU. Trade-off: US is SPOF. (2) **CRDT-style versioning**: row includes version vector {US: 5, EU: 3}. UPDATE increments US to 6, DELETE increments EU to 4. When they conflict, version {US: 6, EU: 4} wins over {US: 5, EU: 4}. Merge: row is deleted (EU won on EU clock) but photo_url preserved as tombstone. (3) **Application-layer resolution**: add conflict_resolution_timestamp. On conflict, alert user "Your photo and deletion collided. Choose: keep photo (undo delete) or delete (lose photo)." Recommended: **(1) + async replication**, or **(2) for high-availability** with CRDT fields.

Follow-up: If this row contains a balance field (financial), does CRDT resolution work, or do you need consensus?