AWS Interview — RDS Aurora Storage Engine and Failover

Your Aurora cluster (writer + 2 read replicas) experiences a writer failover, and the writer instance becomes unavailable. Failover takes 45 seconds, during which your application gets connection errors. Users report a 45-second outage. You need to reduce RTO (Recovery Time Objective). What architectural changes do you make?

Aurora failover time (45 seconds) is determined by: (1) Detection time — health checks run every ~5 seconds. Detection takes ~10-15 seconds. (2) Promotion time — Aurora promotes a read replica to writer, which involves: (a) Stopping read replica replication (~5 seconds), (b) Promoting it to writer role (~10 seconds), (c) DNS update to point to new writer (~15-30 seconds, depending on TTL). To reduce RTO: (1) Reduce health check frequency — currently ~5 sec default. This can't be changed directly, but AWS CloudWatch alarms with shorter evaluation periods can trigger faster notifications. (2) Use connection pooling with fast failover detection — if your application uses connection pooling (e.g., pgBouncer, ProxySQL), configure it to detect unhealthy connections and retry against replicas. This masks some of the 45-second window. (3) Add a read replica in the same AZ as the writer — during failover, promoting a same-AZ replica is faster (DNS propagation is instant within AZ). `aws rds add-db-cluster-members --db-cluster-name my-cluster --db-instance-identifiers "read-replica-same-az"`. (4) Use Aurora Global Database for cross-region failover (if applicable) — allows 1-2 second RPO (Recovery Point Objective) but adds complexity. (5) Most effective: Application-side failover logic — instead of relying on Aurora failover, have your application detect writer unavailability (via connection timeout) and immediately switch to read replicas with an intent-to-write flag or queue writes for later. This can reduce perceived outage from 45 seconds to <1 second. (6) Use RDS Proxy for connection management — it detects unhealthy connections faster and can route traffic to replicas during writer downtime. `aws rds create-db-proxy --db-proxy-name my-proxy --engine-family MYSQL --auth ... --target-db-cluster my-cluster`. RDS Proxy failover is typically 2-3 seconds. Combination approach: RDS Proxy + same-AZ replica usually achieves <5 second RTO.

Follow-up: You implement RDS Proxy and add a same-AZ read replica. Failover RTO drops to 8 seconds. Application still reports brief errors during this window. How do you eliminate those errors entirely?

Your Aurora cluster has auto-scaling enabled for read replicas. During a traffic spike, a new read replica is added. It takes 15 minutes to become ready. During that window, queries are slower than expected. Why the delay, and how do you speed it up?

Aurora replica provisioning time (15 minutes) is dominated by: (1) Instance launch — EC2 instance boots (~2-3 minutes). (2) Storage setup — the replica must copy the Aurora shared storage metadata and indexes (~5-10 minutes, depending on cluster size). (3) Replication catchup — if there's write lag, the replica must catch up to the current LSN (Log Sequence Number). (4) Warming of buffer pool — the replica must read frequently-accessed pages into memory. To speed up: (1) Pre-create standby read replicas before spikes — `aws rds create-db-instance --db-instance-identifier replica-standby --db-cluster-name my-cluster --db-instance-class db.r6g.large`. Having warm replicas ready eliminates launch and storage setup delays. (2) Use Aurora Auto-Scaling Target Tracking with appropriate scale-up metrics — configure it to use CPU% as the metric and set thresholds to scale earlier, before hitting limits. `aws autoscaling put-scaling-policy --auto-scaling-group-name asg-name --policy-name scale-up --policy-type TargetTrackingScaling --target-tracking-configuration TargetValue=60,PredefinedMetricSpecification={PredefinedMetricType=RDSReaderAverageCPUUtilization}`. This triggers scaling at 60% CPU, giving 15 minutes to provision before traffic peaks. (3) Use S3 backup for new replicas — restore from a recent snapshot instead of building from shared storage. For very large clusters (1+ TB), this is slower, but for < 100 GB, snapshots are faster. (4) Use Aurora MySQL 8.0 or PostgreSQL 13+ — newer versions have faster replica initialization. Typical read replica ready time is 5-10 minutes if storage is small and pre-warmed. For production, pre-create replicas during off-peak hours or use Provisioned Concurrency (if available) to guarantee capacity.

Follow-up: You pre-create standby read replicas to keep them warm. During a traffic spike, auto-scaling is triggered, but the new replicas still take 10 minutes to be marked as "ready". You check the replica status and it's "available" after 5 minutes but doesn't receive traffic. Why?

Your Aurora cluster has a large dataset (2 TB). You run a long-running analytical query on a read replica. Partway through, the read replica crashes. The query is lost, and you need to re-run it. What can you do to prevent this loss in the future?

Aurora replicas are ephemeral in the sense that their state (running queries) is lost on instance failure. To prevent query loss: (1) Run analytics queries on the writer instance if possible — the writer has backups and failover protection, so queries are less likely to be lost. However, this impacts writer performance. (2) Use Aurora Parallel Query (for MySQL 5.6+, limited availability) — distributes large scans across multiple workers in the writer, reducing load on replicas. Enable: `aws rds modify-db-cluster --db-cluster-identifier my-cluster --enable-parallel-query`. (3) Export data to S3 for analysis — use AWS DMS or Aurora's native export: `aws rds start-export-task --source-arn arn:aws:rds:region:account:cluster:my-cluster --s3-bucket-name my-bucket --s3-prefix exports/ --export-task-identifier my-export --iam-role-arn arn:aws:iam::account:role/export-role`. Then run analytics in Athena or Redshift. (4) Create a dedicated read replica just for analytics, with snapshot backups enabled — if it crashes, restore from snapshot and resume. `aws rds create-db-cluster-snapshot --db-cluster-identifier my-cluster --db-cluster-snapshot-identifier analytics-backup`. (5) Use RDS Enhanced Monitoring to detect replica health issues early and migrate long-running queries before failure. Monitor `EngineUptime` metric. (6) Connection pooling with query timeout — set connection timeouts and query timeouts so queries that take too long fail gracefully instead of losing all progress. In your database client: `SET SESSION MAX_EXECUTION_TIME=3600000` (1 hour for MySQL). (7) For critical analytical workloads, use Redshift or Athena instead of Aurora — they're designed for long-running queries and are more resilient than replicas. Most practical approach: Export to S3 + Athena for analytics, reserve Aurora for OLTP.

Follow-up: You export data to S3 and run queries in Athena. Athena query performance is slower than Aurora. You're considering running analytics back on Aurora but with query timeout protection. Is there a better way?

Your Aurora cluster stores financial transaction data. You enable automated daily backups (default 1-day retention). One day, you discover data corruption from 3 days ago that affects the entire cluster (introduced by a buggy migration script). Backups from the past 3 days are all corrupted. Can you recover to a non-corrupted state?

By default, Aurora retains backups for 1 day (adjustable up to 35 days). If all backups in the retention window are corrupted, you cannot restore to a clean state using automated backups. To prevent this: (1) Increase backup retention — `aws rds modify-db-cluster --db-cluster-identifier my-cluster --backup-retention-period 14` (14 days). This extends the window for recovery. However, if corruption propagates across all replicas within 14 days, you're still stuck. (2) Use Aurora Global Database — this replicates to another region with a 1-2 second RPO. If primary region is corrupted, the secondary region is also corrupted. Not a solution for logical data corruption, only physical failure. (3) Use AWS Database Activity Stream — logs all SQL activity to CloudWatch/S3. If corruption is detected, you can identify the corrupting queries and manually restore specific data. (4) Take manual snapshots before risky operations — before running migration scripts, create a snapshot: `aws rds create-db-cluster-snapshot --db-cluster-identifier my-cluster --db-cluster-snapshot-identifier pre-migration-backup`. This snapshot is retained indefinitely until you delete it. (5) Use point-in-time recovery (PITR) — Aurora supports restoring to any second within the backup retention window. If you set retention to 14 days, you can recover to 13 days ago if backups are clean. `aws rds restore-db-cluster-to-point-in-time --source-db-cluster-identifier my-cluster --db-cluster-identifier my-cluster-restored --restore-type copy-on-write --restore-to-time "2024-01-01T10:00:00Z"`. This creates a new cluster at a specific point in time. For financial/critical systems, the best practice: (1) Backup retention of 14-35 days, (2) Manual snapshots before risky operations, (3) Database Activity Stream for audit, (4) Staging environment for testing migrations before production, (5) Consider Aurora Global for multi-region redundancy. If you discover corruption today (day 3 of 3-day retention), check if backups from before day 0 are still available. If retention is ≥3 days, you can restore to before the corruption was introduced.

Follow-up: You increase backup retention to 14 days and discover that backups from 2 days ago are clean (pre-corruption). You restore to a point in time 2.5 days ago. The restored cluster has the correct data. But now you need to sync 1.5 days of transactions from the current corrupted cluster (orders, payments, etc.) to the restored cluster. How?

Your Aurora cluster uses the Aurora parallel query feature to speed up analytical queries. Parallel query returns results faster than regular queries, but you notice it's not always used — some queries run in sequential mode. Why, and when should you force parallel query?

Aurora Parallel Query is automatically used for queries that: (1) Scan large tables (typically > 100 MB), (2) Use table scans (not index seeks), (3) Use predicates that can be parallelized (WHERE clauses). Queries that don't use Parallel Query: (1) Queries using indexes (faster than parallel scan anyway), (2) Small tables (< 100 MB, faster to scan sequentially), (3) Queries with non-parallelizable operations (some aggregations, window functions), (4) Queries from users without `rds-parallel-query` permission. To diagnose which queries use Parallel Query: (1) Enable Enhanced Monitoring — `aws rds modify-db-instance --db-instance-identifier my-instance --enable-iam-database-authentication` and check CloudWatch logs for "using_parallel_query: yes/no" in the query plan. (2) Use MySQL/PostgreSQL EXPLAIN to see if Parallel Query is used: `EXPLAIN SELECT ... \G` (MySQL) or `EXPLAIN (ANALYZE, BUFFERS) SELECT ...` (PostgreSQL). Look for "Parallel Workers" in the plan. (3) Force Parallel Query by setting `SET SESSION aurora_parallel_query = ON` (MySQL). However, not all queries can use it — Aurora optimizes and falls back if the query isn't suitable. You can't truly "force" it beyond enabling the session setting. Best practices: (1) For known long-running analytical queries, use Parallel Query on a read replica dedicated to analytics. (2) For OLTP workloads, keep Parallel Query off (set to OFF) on the writer to avoid overhead. (3) Monitor query performance — if a query is running slowly even with Parallel Query, the issue is likely join complexity or lack of indexes, not parallelization. (4) Consider using Redshift for analytics instead — it's purpose-built for parallel queries and scales better than Aurora.

Follow-up: You enable Parallel Query on a read replica and run a 10-minute full-table scan. Parallel Query is being used (you confirm in EXPLAIN), but it's only 20% faster than sequential. Why isn't it more effective?

Your Aurora MySQL cluster has row-level security enabled for compliance. You add a new column to track data sensitivity. After the migration, queries that worked before now run slowly. The schema change (ALTER TABLE) took 5 minutes, and queries are still slower an hour later. Explain what's happening and how to recover performance.

ALTER TABLE in Aurora MySQL is complex: (1) If the change is online-safe (e.g., adding a nullable column), Aurora uses ALGORITHM=INPLACE, which doesn't lock the table. However, it still: (a) Rebuilds indexes, (b) Updates statistics, (c) Recomputes metadata. (2) If the change requires copying (e.g., changing NOT NULL to NULL), Aurora uses ALGORITHM=COPY, which locks the table and rebuilds it entirely. (3) Even with online-safe changes, post-ALTER, statistics are invalidated and need to be recomputed. Queries use stale statistics and choose suboptimal plans. For query slowness post-ALTER: (1) Recompute statistics immediately — `ANALYZE TABLE my_table PERSISTENT FOR ALL` (MySQL 8.0+) or `ANALYZE TABLE my_table` (older versions). This forces statistics recomputation: `aws rds modify-db-instance --db-instance-identifier my-instance --no-apply-immediately` then issue `ANALYZE TABLE` from a query runner. (2) Check if indexes need rebuilding — after ALTER, indexes might be fragmented. Run `OPTIMIZE TABLE my_table` to rebuild: `OPTIMIZE TABLE my_table;`. This locks the table briefly but ensures optimal index layout. (3) Check query plans — run `EXPLAIN SELECT ...` for slow queries and compare before/after ALTER. If the plan changed to a less optimal join order, manually add USE INDEX hints: `SELECT * FROM table1 USE INDEX (idx_name) JOIN table2 ...`. (4) If the new column was added without the right data type or default, it might have bloated the row size, slowing I/O. Check row size: `SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'my_table'` and look for `DATA_LENGTH` and `AVG_ROW_LENGTH`. (5) For future ALTERs, use online, non-blocking migrations: `ALTER TABLE my_table ADD COLUMN new_col INT, ALGORITHM=INPLACE, LOCK=NONE;`. After ALTER, always run `ANALYZE TABLE my_table` and `OPTIMIZE TABLE my_table` to ensure clean statistics and indexes.

Follow-up: You run ANALYZE TABLE and OPTIMIZE TABLE. Query performance improves 30%, but it's still 20% slower than before the ALTER. You check EXPLAIN and the query plan is identical to before. What could explain the remaining slowdown?

Your Aurora cluster runs MySQL 5.7 with a large InnoDB buffer pool (200 GB). You're upgrading to MySQL 8.0 (on a new cluster). Post-upgrade, queries are slower by 10-15%, even though CPU and memory metrics look healthy. The buffer pool is still 200 GB. What's the culprit?

MySQL 5.7 to 8.0 upgrade can cause slowdowns due to: (1) Query optimizer changes — MySQL 8.0's optimizer is more aggressive with some rewrites but can be slower for edge-case queries. (2) InnoDB changes — MySQL 8.0 changed how InnoDB handles row locks and transaction isolation, potentially causing more lock contention. (3) Statistics refresh — MySQL 8.0 uses persistent statistics by default, but migration doesn't automatically refresh them. Old statistics can lead to suboptimal plans. (4) Default settings differences — innodb_flush_log_at_trx_commit defaults changed, affecting fsync behavior. (5) Query plan caching — MySQL 8.0 caches query plans, but if plans aren't warming up properly, it can cause cache misses. Debug: (1) Recompute statistics for all tables — `SELECT CONCAT('ANALYZE TABLE ', table_schema, '.', table_name, ';') FROM information_schema.tables WHERE table_schema NOT IN ('mysql', 'information_schema') INTO OUTFILE '/tmp/analyze.sql'; SOURCE /tmp/analyze.sql;`. This refreshes statistics. (2) Compare query plans between 5.7 and 8.0 for slow queries: run `EXPLAIN FORMAT=JSON SELECT ...` on both versions and check if optimizer chose different indexes/joins. (3) Check slow query log — `SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 0.5;` and run workload for 1 hour, then compare slow queries between versions. (4) Benchmark key queries — run a known query 1000 times on both versions and measure variance. (5) Check transaction isolation level — MySQL 8.0 defaults to REPEATABLE-READ, but if you're using READ-COMMITTED on 5.7, the difference in locking can slow things down. Set: `SET GLOBAL transaction_isolation = 'READ-COMMITTED';` on 8.0 if needed. (6) Most likely: Refresh statistics. Run `ANALYZE TABLE` for all large tables and re-benchmark. Typically, 90% of upgrade slowdowns are resolved by recomputing stats.

Follow-up: You recompute statistics with ANALYZE TABLE on all tables in the new 8.0 cluster. Query performance improves slightly (5%), but it's still 10% slower than 5.7. You run a test query 1000 times and measure individual execution times — they're consistent, so no variance issue. What's the remaining 10%?