System Design Interview — Multi-Region Active-Active Design

AWS region us-east-1 goes down completely (power outage at data center, lasts 4 hours). Your production system spans us-east-1 (60% of traffic) and us-west-2 (40% of traffic). Users in us-east-1 are now down. What's your architecture so this never causes outage?

The Failure Scenario: if us-east-1 is down, all traffic from that region (and all requests routing to that region) should automatically failover to us-west-2. But failover means degradation: users experience latency increase (300ms instead of 50ms), and if us-west-2 also gets traffic spike, it could cascade into failure.

Active-Active Architecture: (1) Database layer: Primary DB in us-east-1, read replica in us-west-2. Setup continuous replication (RTO ~30s). On us-east-1 failure, promote us-west-2 replica to primary (30s downtime). Users can tolerate this. (2) API layer: 50 instances in us-east-1, 50 instances in us-west-2. Both regions serve traffic simultaneously. Route53 health checks: if us-east-1 instances fail health check 3 times in a row, remove that region from DNS entirely. Users automatically route to us-west-2. (3) Session/cache layer: Redis in both regions with active-active replication (Redis cluster mode). Writes go to both regions simultaneously via application-level write fan-out. (4) Content delivery: S3 with cross-region replication. Requests route to nearest bucket via CloudFront.

Failover Flow: us-east-1 fails → Route53 detects unhealthy instances within 30s → DNS stops returning us-east-1 addresses → new requests route to us-west-2 automatically → existing connections to us-east-1 timeout and retry (client reconnects to us-west-2). Total user impact: <1 minute of degradation (connection timeout).

Cost: 2x compute (50 instances per region instead of 75 in single region). But availability gain: from 99.95% (one region) to 99.99%+ (two regions). For e-commerce: $1M/hour downtime risk. 0.05% difference = $500/hour protection. Worth $2M/year infrastructure cost.

Follow-up: During us-east-1 failure, you promote us-west-2 database to primary. But 50 read replicas in eu-west-1, ap-southeast-1 are still replicating from the now-dead us-east-1. They go out of sync. How do you fix this without manual intervention?

Your company operates an international SaaS with data residency requirements: EU users' data must stay in eu-west-1, US users in us-east-1, Asia users in ap-southeast-1. Each region must be able to survive independently. How do you design multi-region active-active with data residency?

The Constraint: You can't replicate EU data to US (GDPR). You can't replicate US data to Asia (data sovereignty). Each region must be fully independent.

Architecture (Sharded Active-Active): (1) Data sharding: data is sharded by customer_region at the application layer. Customer in Berlin is always routed to eu-west-1, even if US-based API instance serves the request. This is done via application middleware that checks customer location and routes queries/writes to correct region's database. (2) API layer: all 3 regions have full API capability. API instances in us-east-1 can serve requests for US customers, EU customers (by redirecting to eu-west-1), and Asia customers (by redirecting to ap-southeast-1). (3) Read/write patterns: EU customers always read/write to eu-west-1 (low latency, compliant). US customers always read/write to us-east-1. But if eu-west-1 is down, can EU customers temporarily read from us-east-1 backup? No—GDPR doesn't allow it. So each region must have independent capacity for 100% load (not 50% like normal active-active).

Failover Policy: If eu-west-1 fails, EU users get a maintenance page "We're experiencing regional outages. Your data remains in Europe and will be restored when the region recovers." They cannot failover to us-east-1 due to regulations.

Cost: 3x infrastructure (3 full independent regions). Availability within each region: 99.99%. But availability for EU users is actually that of eu-west-1 only (can't fail over cross-region). Cost-benefit only works if downtime cost is very high.

Follow-up: A customer is headquartered in US but has 20% of staff in EU. They need to access EU data from US sometimes. How do you handle cross-region access without violating data residency?

You're designing real-time collaboration software (like Google Docs) across 3 regions. Two users editing same document from us-east-1 and eu-west-1 simultaneously. Conflict-free replicated data type (CRDT) is used for operational transform. But network partition happens between regions for 30 seconds. What's your design?

The Challenge: During partition, users in us-east-1 can edit the document, users in eu-west-1 can edit separately. Both generate changes locally. When partition heals, you have conflicting edits that must merge consistently across both regions. CRDTs handle this, but you need coordination architecture.

Active-Active with CRDT: (1) Each region has full copy of document + CRDT state. User in us-east-1 makes edit → sent to us-east-1 replica + queued for eu-west-1. (2) During partition: us-east-1 continues accepting edits, applies them to local CRDT. eu-west-1 continues accepting edits independently. Both regions keep operation logs. (3) When partition heals (both regions see each other again): (A) each region sends operation logs to other region, (B) CRDT automatically merges edits using vector clocks (e.g., user A deleted "hello", user B inserted "world"—merge applies both operations correctly), (C) within 2-3 seconds, both regions converge to same document state. (4) Users see final merged result.

User Experience: During partition, users don't notice any lag. When partition heals, they might see their edit "move" (cursor position adjustment) as remote edits merge in, but content is correct. No permanent conflicts.

Tradeoff: Availability is maintained (both regions keep working during partition). Consistency is eventual (takes ~3s to converge after partition heals). This is acceptable for collaboration tools.

Follow-up: During partition, one region generates 10M edit operations (huge document, heavy editing). Other region generates 1M edits. When merging, would CRDT convergence time be fast enough, or could it cause user-visible lag?

Your financial platform replicates account balances across 4 regions (active-active). A user makes a withdrawal from us-east-1 at the exact same moment their account is charged from eu-west-1 (fraud scenario: money drained from both regions before replication catches up). How do you prevent double-spending across regions?

The Race Condition: User balance = $1000 in us-east-1 replica. Attacker sends withdrawal for $900 to us-east-1 AND $900 to eu-west-1 simultaneously. Both go through (balance check passes in both regions independently). After replication: us-east-1 sees -$800, eu-west-1 sees -$800, total drained = $1600 from $1000 account.

Solution: Distributed Transaction Commit: (1) Two-phase commit (2PC) coordinator: every financial transaction goes through central coordinator, not directly to regional replicas. User initiates withdrawal → coordinator in us-east-1 writes to transaction log. (2) Phase 1 (lock): coordinator locks the account in us-east-1 replica: "this account is being modified, no other transactions allowed." Waits for lock to replicate to all other regions (eu-west-1, ap-southeast-1). (3) Phase 2 (commit): Once lock is replicated globally, coordinator confirms transaction, deducts money in all regions simultaneously. (4) If any region rejects the transaction (lock timeout, network partition), entire transaction is aborted (no deduction anywhere).

Cost: Latency increases (must wait for lock replication to all regions, ~100ms). But financial transactions are not latency-critical (users accept 100ms vs accept money drained from account).

Tradeoff: Availability vs consistency. 2PC reduces availability (if 1 region is partitioned, all withdrawals block). But it prevents double-spending (data integrity is critical for finance).

Follow-up: During high-load periods (Black Friday sale, everyone buying), 2PC creates cascading lock contention. Transaction latency jumps from 100ms to 5s (users waiting for lock). How do you maintain financial correctness while avoiding this bottleneck?

You're designing a multi-region MySQL setup with active-active replication. Every write goes to every region simultaneously. But you notice "replication lag" after 1-2 minutes in eu-west-1, while us-west-2 is caught up instantly. Users in eu-west-1 read stale data. What's happening and how do you fix it?

Root Cause: Replication lag happens when the MySQL write-apply queue on a replica gets backlogged. In us-west-2, single-threaded replication worker can apply 5000 writes/sec. In eu-west-1, same worker but network latency to other regions is higher (RTT 150ms). If workload is 6000 writes/sec from all regions, eu-west-1 falls behind by 1000 writes per second = catches up after 60 seconds = 1-2 minute lag.

Diagnosis: SSH into eu-west-1 MySQL, run SHOW SLAVE STATUS. Check Seconds_Behind_Master field.

Fix #1: Parallel Replication: Enable parallel replication in MySQL 5.7+ (slave_parallel_type=LOGICAL_CLOCK). Instead of 1 worker applying writes sequentially, use 8 workers in parallel. Each worker handles independent transactions simultaneously. Throughput jumps from 5K to 40K writes/sec. Lag eliminated.

Fix #2: Regional Write Coalescing: Don't send every write to every region. Instead: writes go to primary region (us-east-1), then replicate to read replicas in all regions at lower priority. This decouples hot writes from regional replication. Lag drops from 2 minutes to 5 seconds.

Fix #3: Read from Primary: For reads that must be fresh: always read from us-east-1 primary. Accept 100ms latency hit. For reads that tolerate staleness (analytics, reporting): read from regional replicas. This splits read load.

Follow-up: After enabling parallel replication, you notice data corruption on eu-west-1 replica—some rows have partial updates. Why did parallel replication cause this, and how do you prevent it?

Your company acquires a startup. The startup's infrastructure is in GCP (us-central1), yours is in AWS (us-east-1, eu-west-1). You need to integrate their user database with yours and run as active-active post-integration. How do you unify data across cloud providers?

The Challenge: You can't use native replication (MySQL replication works within AWS, not across AWS ↔ GCP). You need a cloud-agnostic replication solution.

Solution: Kafka-Based Change Data Capture (CDC): (1) Startup's GCP MySQL has Debezium (CDC tool) running: captures every INSERT/UPDATE/DELETE as event in Kafka topic "startup-users-changes". (2) Kafka cluster lives in GCP (same region as startup DB for low latency). (3) Your AWS setup has consumer reading from GCP Kafka: new event arrives → apply to AWS MySQL in us-east-1. (4) Your AWS MySQL also has Debezium capturing changes → publishes to AWS Kafka → consumed by eu-west-1 MySQL. (5) Startup's GCP MySQL also has consumer for AWS Kafka: keeps GCP in sync.

Architecture Flow: Startup user update (GCP) → Kafka (GCP) → AWS consumer → us-east-1 MySQL + AWS Kafka → eu-west-1 MySQL → Kafka (GCP) consumer → Startup MySQL. Full circle, everything stays in sync.

Conflict Resolution: If both sides update same user simultaneously: use timestamp or version vector. "Last write wins" is simple but risky. Better: application-level awareness (don't allow concurrent updates to same user).

Costs & Tradeoffs: Kafka increases infrastructure cost (+$10K/month). But it's cloud-agnostic and scales to any number of replicas (add Azure, Digital Ocean, etc. later without re-architecting).

Follow-up: Kafka topic has 5-second lag. A user updates their profile on startup side, immediately logs into your AWS side, but their new profile data hasn't replicated yet. They see stale data. How do you handle this user-visible inconsistency?

You deploy a cache layer (Redis) across 3 regions in active-active mode. Write a key in us-east-1, it should instantly appear in eu-west-1 and ap-southeast-1 caches. But network is unreliable (1-2% packet loss between regions). How do you ensure cache consistency?

Naive Approach (Fails): Every write to Redis in us-east-1 is replicated to eu-west-1 and ap-southeast-1 synchronously. If 1-2% of packets are dropped, some replication messages arrive incomplete. Cache becomes inconsistent.

Solution: Ordered Replication with Acks: (1) Every write to us-east-1 Redis includes a sequence number (write #1, write #2, etc.). (2) Replication message is sent to eu-west-1 + ap-southeast-1 with sequence number. (3) Both regions acknowledge receipt: "I received write #5". (4) us-east-1 tracks: which writes have been ack'd from eu-west-1? Which from ap-southeast-1? (5) If eu-west-1 ACK is missing for write #5 after 5 seconds, resend it. (6) When eu-west-1 restarts, request missing writes (sequence #1-#100) from us-east-1.

Consistency Guarantee: After 3 acks (all regions confirmed), the write is durable across all 3. If user reads from eu-west-1 immediately after writing to us-east-1, they get eventual consistency (~100ms). If user must read their own write immediately, route reads back to us-east-1 (read-after-write consistency).

Trade-off: Write latency increases from 10ms to 150ms (wait for all 3 acks). But cache stays consistent. For most use cases, 150ms is acceptable for cache (data in cache is non-critical).

Follow-up: After one region fails (eu-west-1 outage), you're waiting for ACKs that never come. All writes to us-east-1 now block for 5s timeout waiting for eu-west-1 ACK. Latency for other users (not in eu-west-1) suffers. How do you handle this?

Your company expands globally. You now run 10 regions: 5 in AWS, 3 in Azure, 2 in GCP. Each region runs independently. But you need global transaction support: transfer 1000 credits from user in Singapore (GCP) to user in Ireland (AWS). How do you coordinate?

Multi-Cloud Consensus: Single-region transactions use ACID (easy). Cross-cloud transactions need distributed consensus. (1) Coordinator service (runs in all 10 regions, elects a leader via Raft): transaction request arrives at Singapore region. Singapore coordinator becomes "transaction leader". (2) Two-phase commit across all 10 regions: Phase 1 (prepare): Singapore coordinator asks all 10 regions "can you deduct 1000 credits from this user and lock the account?" All 10 respond "yes, locked". (3) Phase 2 (commit): coordinator instructs all 10 to finalize the deduction. All 10 confirm. Transaction complete.

Complexity: If any region is partitioned (can't reach), entire transaction aborts. This is the cost of strong consistency across 10 regions.

Practical Alternative: Eventual Consistency with Compensation: (1) Deduct 1000 credits from Singapore immediately (optimistic). (2) Queue a "credit transfer" job: send 1000 credits to Ireland region. (3) Ireland region receives job, credits user. (4) If transfer fails (Ireland offline), job retries forever until success. (5) If deduction succeeds but transfer fails permanently, Ireland issues refund to Singapore (compensation transaction). This gives lower latency (not waiting for all 10 regions) but requires handling failure scenarios.

Trade-off: Strong consistency (2PC) = high latency (500ms+), complex coordination. Eventual consistency = low latency (50ms), simpler code, requires compensation logic.

Follow-up: During the 2PC commit phase, one region becomes unavailable and times out. The other 9 regions have already committed the deduction. You now have 1000 credits deducted from Singapore with no corresponding credit to Ireland. How do you recover?