AWS Interview — Disaster Recovery: RPO, RTO, and Multi-Region

Your fintech platform processes KYC (Know Your Customer) verifications. Regulations require: RPO (Recovery Point Objective) = 15 minutes, RTO (Recovery Time Objective) = 2 hours. You're in us-east-1 only. A region-wide outage occurs; AWS restores it in 3 hours. Your backup is 8 hours old. You've lost 3 hours of new verifications (500 customers). Regulators fine you. Design a 15-min RPO / 2-hour RTO architecture from scratch.

Fintech DR architecture for 15-min RPO / 2-hour RTO: (1) Primary region (us-east-1): RDS PostgreSQL (Multi-AZ), DynamoDB (on-demand), Lambda (auto-scaling). (2) Secondary region (us-west-2, standby): RDS read replica (streaming replication every 5 minutes from primary), DynamoDB continuous backups + Global Secondary Index for on-demand failover, S3 cross-region replication (immediate). (3) RPO 15 min achieved by: (a) RDS: enable automated backups every 5 minutes (Aurora does this automatically). Streaming replication to standby has 1-2 sec lag. (b) DynamoDB: enable Point-in-Time Recovery (PITR), 35-day retention. If primary fails, restore to 15 min ago. (c) Application: write to primary, async queue (SQS) logs important events to S3/DynamoDB backup every 5 min. (4) RTO 2 hours achieved by: (a) DNS failover: use Route53 with health checks (60-second check interval). If primary RDS fails health check, switch traffic to us-west-2 replica within 2 minutes. (b) RDS promotion: Aurora read replica auto-promotion to primary takes <10 seconds. (c) Lambda: stateless, redeployed in us-west-2 in 30 seconds. (d) Testing: weekly DR drill rotating standby to primary role. Cost: 2x infrastructure (primary + hot standby) = ~2x cloud bill, but regulatory compliance worth it. (5) Implementation: 3 weeks (RDS replica setup, DynamoDB PITR + replication, Route53 health checks, DR runbook + team training). Alternative if budget-constrained: warm standby (u-west-2 with 1/5 resources). Failover promotion then scales automatically. Saves ~40% but RPO increases to 30 min, RTO to 4 hours (check if acceptable to regulators).

Follow-up: Failover to us-west-2 succeeds, but the standby DynamoDB has stale KYC records (10 minutes behind primary). Customers are denied, then approved again. How do you handle replay of denied KYC during failover?

Your analytics platform has 50 TB of data in S3 (us-east-1). RPO requirement: 1 hour (regulatory). RTO requirement: 24 hours (business can tolerate a day of reports being stale). Your data team says: "We'll use S3 cross-region replication." But replication takes 4+ hours for 50TB. You only have a 1-hour RPO window. How do you achieve RPO within the replication constraint?

S3 replication itself doesn't meet 1-hour RPO because it's asynchronous and takes time. The solution is to separate "replication" from "RPO achievement": (1) Separate data into hot (1-7 days old, <10TB) and cold (7+ days old, 40TB). (2) Hot data: use S3 Replication Rules with "real-time" setting (99.99% replicated within 15 minutes). But this still leaves 10TB at risk. Pair with: (a) Versioning: enable on both primary and backup buckets. Every write creates a new version; can recover to any version within 1 hour. (b) S3 Object Lock: optional, if compliance requires immutability. (3) Cold data (40TB): daily scheduled replication using AWS DataSync or S3 batch copy job. Replication window: 23:00-04:00 UTC = 5 hours. Covers 24-hour RTO easily. (4) RPO achievement: (a) Hot data: 1-hour RPO via versioning + replication. If primary region fails at T=0, restore from version at T-1 hour in us-west-2 backup bucket. Data loss = 1 hour max. (b) Cold data: 24-hour RPO (acceptable given business SLA). (5) Cost: S3 replication = $0.02/1K objects + data transfer (internal bandwidth ~$0.02/GB if in same AWS account). For 50TB: ~$1,000/month. Adding DataSync for cold data: 1TB/day * 30 = 30TB transferred * $0.0125/GB = $375/month. Total: ~$1,400/month for RPO/RTO achieved. Alternative if budget tight: scheduled snapshots every hour instead of real-time replication. Manual failover, but meets RPO.

Follow-up: You replicated 50TB successfully, but during failover, you realize the replication job ran only 10 hours ago. Current data (last 1 hour) is lost. Was RPO actually 1 hour or 10 hours?

Your SaaS platform has customers in 3 regions (US, EU, Asia). Each region has its own infrastructure (database, cache, storage). A major customer in EU has a contract: "99.99% availability, failover within 30 min if primary data center fails." You're currently using single-region deployments (single RDS, single ElastiCache). How do you architect multi-region DR for this customer?

Multi-region architecture for 30-min RTO SLA: (1) Primary region (eu-west-1): RDS Aurora cluster (Multi-AZ), ElastiCache Redis (Multi-AZ), application servers (auto-scaling). (2) Secondary region (eu-central-1, warm standby): Aurora read replica (streaming, <1 sec lag), ElastiCache replica (real-time), application servers (minimum 1 instance, ready for scale). (3) Data consistency: (a) RDS: all writes go to primary (eu-west-1). Reads can hit replica (eventual consistency, 1-sec lag acceptable for SaaS). (b) ElastiCache: writes go to primary, replica is read-only. Or, use Redis replication (enterprise feature, automatic failover). (c) For strong consistency: Route53 weighted routing 90% to primary (fast), 10% to secondary (reads only). (4) Failover automation: (a) Health checks: Route53 health checks on primary RDS (custom health check Lambda querying RDS metrics). If primary fails, health check fails within 60 sec. (b) DNS cutover: Route53 weighted policy switches to 100% secondary within 120 sec of failed health check. (c) RDS promotion: Aurora read replica promotion to primary = 30-60 sec. (d) Total failover time: 120 sec (Route53) + 60 sec (RDS promotion) = 3 minutes. Well under 30-min SLA. (5) Cost: 2x infrastructure for 99.99% availability = ~1.8x (secondary can be smaller). For €5K/month primary, secondary adds €4K/month = €9K/month total. ROI: SLA compliance enables contract premium (+10-20% margin). (6) Testing: failover drill monthly. Rotate primary/secondary roles quarterly to validate both can handle full production load.

Follow-up: During failover, the warm standby in eu-central-1 was provisioned with only 2 app servers. Primary handled 100K concurrent users; failover causes CPU to spike to 100%, requests timeout. How do you prevent this during failover?

Your batch processing job runs nightly (2 AM UTC) and processes 200GB of data. It takes 4 hours (finishes 6 AM). If the job fails at 5 AM, you have until 2 PM to retry, or customers see stale reports. RPO is 24 hours (one day stale is acceptable). Design a fault-tolerant batch pipeline with checkpointing.

Fault-tolerant batch pipeline with checkpointing for 24-hour RPO: (1) Architecture: Step Functions orchestrates Glue/Spark job. Glue job processes data in stages (stage 1-5, each 40GB). (2) Checkpointing mechanism: (a) After each stage, write progress to DynamoDB table: { job_id, stage_number, timestamp, status, data_hash }. (b) Before stage, query DynamoDB for last completed stage. If stage 3 completed at 4:45 AM, restart from stage 4 (save 1.5 hours). (c) Idempotency: each stage is idempotent (query input, produce output without side effects). (3) Failure handling: (a) Step Functions retry policy: max 3 retries with exponential backoff. If all retries fail, invoke SNS notification and manual escalation. (b) Glue job timeout: 6 hours (2 hours buffer beyond normal 4-hour runtime). If timeout hits, Glue auto-cancels, step function retries. (c) DynamoDB write failures: write checkpoint to S3 as JSON fallback. (4) Resume logic: (a) Operator manually triggers `ResumeJob(job_id, start_stage=4)` in console (or auto-retry if Step Functions detects transient failure). (b) Glue job re-reads checkpoint, skips stages 1-3, processes stage 4-5. Runtime: 2 hours (vs 4 hours from scratch). (c) If data corruption detected (data_hash mismatch), trigger full re-run from stage 1 (safety mechanism). (5) RPO/RTO: RPO = 1 day (one day old data is checkpoint baseline). RTO = 2-4 hours (resume from last checkpoint vs full re-run). Implementation: 2 weeks (DynamoDB schema, Glue checkpointing code, Step Functions logic, testing). Cost: minimal (DynamoDB 10 RCU on-demand = $5/month, Glue runtime same, S3 storage <$1/month).

Follow-up: Job resumed from stage 3 checkpoint, but stage 4 output is 5% smaller than expected. Data was silently dropped. How do you detect data loss during checkpointing?

Your company has a "pilot" Kubernetes cluster on EC2 in us-east-1. It runs 20 microservices. If the cluster fails, the entire platform goes down (no DR). Your VP of Eng says: "We need multi-region K8s clusters for high availability, but that requires Istio, service mesh, and doubles complexity." You have 8 weeks to implement DR. What's your architecture: full multi-region active-active, or simpler alternative?

Skip active-active; implement active-passive multi-region K8s for fast failover without service mesh complexity: (1) Primary cluster (us-east-1, active): EKS managed cluster, 20 microservices deployed normally. (2) Secondary cluster (us-west-2, standby): EKS managed cluster with same config, but no pods running. Cluster is "empty" (only system pods). Cost: EC2 node group baseline ~$300/month idle. (3) Continuous sync: (a) Terraform/Helm manages cluster configs identically in both regions. Source of truth: Git repository. (b) Application images: pushed to ECR in both regions (cross-region replication). (c) Persistent data (DB, cache): RDS Multi-region replica, ElastiCache replication. (4) Failover mechanism: (a) Route53 health check on primary K8s API endpoint. (b) If primary fails, health check times out within 60 sec. (c) Route53 switches traffic to us-west-2. (d) Manual trigger: DevOps operator runs `kubectl apply -f production-workloads.yaml` in us-west-2 cluster. Pods scale up in 2-5 minutes. (e) Total RTO: 60 sec (Route53) + 120 sec (pod startup) = ~3 minutes. (5) Cost optimization: (a) Primary cluster: 20 nodes, ~$2K/month. (b) Standby cluster: 1 node (just API server), ~$100/month. (c) During failover, standby auto-scales to 20 nodes = +$2K/month, but temporary (10 min to 1 hour). (d) Total steady-state cost: $2.1K/month (10% overhead for standby). (6) Alternative: single region with multi-AZ EKS (only 15% cost for 99.9% availability, not multi-region 99.99%). Acceptable if business can tolerate 4-hour RTO for full region failure (rare). Choose based on SLA: 99.9% (single-AZ spread), 99.99% (multi-region). Implementation: 4 weeks (Terraform multi-region setup, cross-region replication, Route53 health checks, DR drill).

Follow-up: Failover to us-west-2 succeeded, but one microservice has a stale configmap (points to us-east-1 endpoints). It fails because primary region is down. How do you prevent stale configs during failover?

Your mobile app has <20ms P99 latency SLA. Users are in US (60%), EU (25%), Asia (15%). You're using us-east-1 only, which causes 150ms latency for Asia users. You want to implement disaster recovery + low-latency with read replicas in 3 regions. But your RTO SLA is 15 minutes (currently 4 hours). How do you achieve both?

Low-latency + fast failover with multi-region read replicas: (1) Architecture: Primary database (us-east-1, RDS Aurora MySQL) with writes. Read replicas in us-west-1 (US latency), eu-west-1 (EU latency), ap-southeast-1 (Asia latency). Each replica region has its own application servers. (2) Latency optimization: (a) Route users based on geography: Route53 geolocation routing. US users → us-east-1 app servers. EU users → eu-west-1 app servers. Asia users → ap-southeast-1 app servers. (b) Each region reads from local read replica (1-2ms instead of 150ms cross-region). (c) Result: all regions achieve <20ms P99 latency. (3) Writes: all writes go to us-east-1 primary. Write latency from other regions = network round-trip to us-east-1 (100-200ms). Acceptable if write-heavy ops are rare (most apps are read-heavy). (4) Failover design (15-min RTO): (a) If us-east-1 primary fails, promote eu-west-1 read replica to primary (takes 30-60 sec). (b) Route53 weighted routing: 100% traffic to eu-west-1 during outage. (c) Other replicas (us-west-1, ap-southeast-1) then replicate from eu-west-1 (new primary). (d) Total failover time: 60 sec (promotion) + 30 sec (Route53 propagation) = <2 minutes. Far better than 15-min SLA. (5) Cost: primary + 3 read replicas = ~4x database cost ($800/month for primary Aurora, $3.2K total). Plus 3x app server regions. Reasonable for global SLA. (6) Data consistency: read replicas lag <1 sec. During write requests, app can retry if needed (idempotent writes). For payment systems, add versioning to detect duplicate writes. Implementation: 6 weeks (multi-region Aurora setup, Route53 config, app geo-routing code, testing).

Follow-up: Asia replica is 120 seconds behind primary, but an Asia user submits a payment. They immediately read from local replica and see no transaction (it's still in-flight to us-east-1). How do you prevent read-before-write consistency violations?

Your company has a "disaster recovery contract" with a compliance auditor: "Test failover at least quarterly. Document the actual time taken (RTO) and data loss (RPO). Any deviation >10% is a finding." You have a 1-hour RTO SLA and 15-min RPO SLA. Your last 3 quarterly tests were 58 min, 1 hour 3 min, and 59 min. You're within 10%, but averaging 1 hour. Your VP asks: "Should we invest in automation to cut RTO to 30 min?" What's your recommendation?

Recommendation: invest in automation, but the ROI depends on compliance cost vs operational cost. Analysis: (1) Current state: average RTO 1 hour, within 10% tolerance (1 hour SLA allows up to 66 min). Quarterly tests cost ~4 hours labor each = 16 hours/year. Risk: if production outage occurs, manual failover takes 1 hour; might exceed SLA during real event (testing variance). (2) Automation investment (Route53 health checks + auto-promotion + Lambda orchestration): reduces RTO to 30 min reliably. Cost: 3 weeks dev + 1 week testing + 1 week ops setup = $15-20K (senior eng time). Ongoing: 1 hour/month maintenance = $1.2K/year. (3) ROI: (a) Compliance benefit: documented RTO of 30 min (well under 1 hour SLA) removes audit risk of future failures. One compliance exception = $10-50K legal/remediation cost. Insurance for $15K is reasonable. (b) Business benefit: during real outage (assume 1 per 3 years, industry average), auto-failover saves 30 min of downtime. Cost of 30 min outage = user churn + reputational damage. For SaaS with $5M ARR, 30 min downtime = ~$1K customer impact + brand damage. Reduced if failover is 30 min vs 1 hour. (c) Operational benefit: removes manual failover risk (human error, slow decision-making). Automation is deterministic. (4) Decision matrix: (a) If you have quarterly compliance audits requiring documented RTO proof: automating is ROI positive. Cost $15K once, saves audit findings ($10K+ each). (b) If you have zero compliance requirements: automation is optional. Stay with manual failover (cheap, acceptable). (c) If outages cost >$2K each or you have SLAs with penalties: automation ROI is 2-3x. Recommendation: if compliance-driven, automate now. If operational-only, defer 6 months (re-evaluate if outage occurs). Timeline: automation reduces workload to 1 hour/quarter (downtime only), vs current 4 hours/quarter (manual orchestration).

Follow-up: After automation, your first production failover is triggered, but auto-promotion fails (Lambda times out). Manual override succeeds, but took 45 min total. Worse than the 1-hour SLA, but less than before. How do you debug and prevent automation failures?