Database corrupted at 3am. Logs show DELETE statement at 2:58am dropped critical data. Restore to 2:59am state using PITR. Walk through steps: backup location, WAL archive, recovery_target_time, verification.
PITR recovery requires base backup + continuous WAL archiving. Steps: (1) Verify recovery infrastructure: Check backup exists: ls -lh /mnt/backups/base_backup_* (or S3: aws s3 ls s3://backup-bucket/). Check WAL archive: ls -lh /mnt/wal_archive/ | tail -20 (should have WAL segments up to ~3am). (2) Prepare recovery directory: Create new cluster dir: mkdir -p /var/lib/postgresql/recovery-cluster; chown postgres:postgres /var/lib/postgresql/recovery-cluster. (3) Restore base backup: tar -xzf /mnt/backups/base_backup_2026-04-07_0200.tar.gz -C /var/lib/postgresql/recovery-cluster. (4) Copy WAL archive to pg_wal: cp /mnt/wal_archive/* /var/lib/postgresql/recovery-cluster/pg_wal/. (5) Configure recovery: Edit /var/lib/postgresql/recovery-cluster/postgresql.auto.conf: recovery_target_time = '2026-04-07 02:59:00' recovery_target_inclusive = true recovery_target_timeline = 'latest'. Create recovery.signal: touch /var/lib/postgresql/recovery-cluster/recovery.signal. (6) Start PG: systemctl --user -u postgres start postgresql (on recovery dir). Check logs: tail -f /var/lib/postgresql/recovery-cluster/log/postgresql.log. Should show "database system is ready" after replay. (7) Verify: psql -d mydb -c "SELECT COUNT(*) FROM critical_table"; should match pre-corruption count. Connect as temporary instance to verify, then promote back to primary if needed.
Follow-up: Explain recovery_target_inclusive. Why set true vs. false for time-based recovery?
Large database (2TB). pg_basebackup takes 4 hours. During backup, primary receives writes. Backup becomes inconsistent mid-way. How to ensure backup consistency?
pg_basebackup with -Xs (stream) or -Xf (fetch) WAL during backup. If writes occur during backup, snapshots may be inconsistent unless WAL is applied during recovery. Root issue: Long backup means large window for changes. Solutions: (1) pg_basebackup -Xs ensures WAL streaming happens during backup; on recovery, WAL replay makes it consistent. Command: pg_basebackup -D /backup/base -h localhost -U postgres -v -P -W -Xs -R (-Xs streams WAL; -R creates recovery.signal). (2) For faster backup: Use pgbackrest (professional tool): pgbackrest backup --repo1-path /backups/repo --stanza=my-db (uses full+differential+incremental backups). (3) Use pg_upgrade if only schema changes: Backup only schema, not data. (4) Parallel backup: Use -j flag: pg_basebackup ... -Xs -j 4 (parallel 4 workers; 10x faster). (5) Backup on standby: Take backup from standby, not primary (zero impact on primary). (6) Compressed backup: pg_basebackup ... --format=tar --gzip --compress=9 (slow but reduces disk/network). (7) Limit backup network: pg_basebackup ... --max-rate=100m (100MB/s; prevent network saturation). Best practice: Backup from standby with parallel + compression during off-peak. Measure: pg_basebackup --format=plain --label='test' --log-stream 2>&1 | tee backup.log to time.
Follow-up: Explain -Xs vs. -Xf in pg_basebackup. Which is faster and safer?
Backup RPO: 1 hour (data loss tolerance). Design backup strategy: Full vs. incremental, frequency, retention, and verify RTO (4 hours to restore).
RPO=1h, RTO=4h: Lose max 1h data, restore in 4h. Strategy: (1) Full backup daily (night): pg_basebackup ... -Xs at 2am. (2) Incremental WAL archiving (continuous): All WAL to /mnt/wal_archive. Recovery: Restore yesterday's full backup (10min) + replay WAL from last 1h (5min) = 15min < 4h RTO. (3) Retention: Keep 2 full backups (2 days), archive WAL for 7 days (covers incremental). (4) Automation: cron job daily: 0 2 * * * /backup_script.sh (full backup at 2am). pgbackrest handles retention. (5) Verify backups (critical): Monthly restore test: pg_basebackup --format=tar -D /test_restore; tar -xf backup.tar; recovery_target_time='now'; verify data. Document recovery procedure. Testing: (1) Measure full backup time: time pg_basebackup ... > /dev/null (e.g., 1 hour for 2TB on SSD). (2) Measure restore time: time (tar -xf backup.tar && replay WAL) (e.g., 15min). (3) Verify: SELECT COUNT(*) FROM table BEFORE corruption; -- restore backup; SELECT COUNT(*) FROM table AFTER restore should match. Monitoring: (1) Backup success: Alert if cron fails. (2) WAL archive lag: SELECT NOW() - pg_postmaster_start_time() > INTERVAL '1h' AS needs_full_backup; alert if > 2h (might need extra full backup). (3) Backup age: Alert if latest backup > 25h (next day not yet run).
Follow-up: Explain full vs. differential vs. incremental backups. Which saves most disk?
Restore from backup; during recovery, WAL apply is slow (10MB/s replay). ETA: 8 hours for 2TB WAL. Optimize recovery speed without losing data.
Slow WAL replay bottleneck: I/O or CPU. Optimization: (1) Check WAL source: ls -lh /path/to/wal/archive/ | head -20. If on network mount (NFS), copy to local SSD first: rsync -av /mnt/nfs_wal_archive/ /local/nvme_wal/; cp /local/nvme_wal/* /pg_wal/. (2) Increase maintenance_work_mem on recovery instance: maintenance_work_mem = '4GB' in postgresql.conf (before start). (3) Enable parallel recovery (PG13+): max_parallel_apply_workers = 4; max_parallel_workers = 4. Replay uses 4 workers. (4) Disable fsync during recovery: fsync = off during recovery.signal (risky; only for non-critical recovery). After recovery: fsync = on and CHECKPOINT. (5) Use faster storage: If recovering to HDD, recover to SSD first, then copy data to target. (6) Check if replay is CPU-bound: top during recovery; if single CPU core at 100%, can't parallelize further. (7) Skip unnecessary indexes: Don't create indexes during recovery, rebuild after. (8) Measure improvement: Before: time pg_ctl start shows total recovery time. After optimization: Time should improve. Typical: With parallel workers + SSD + 4GB maintenance_work_mem, WAL replay speed reaches 50-100MB/s. For 2TB @ 100MB/s: ~5h instead of 8h. Best practice: Test recovery procedure; document actual times. If RTO unmet, consider: (1) Standby with faster network. (2) Regular partial backups (PITR snapshots every hour). (3) Distribute backup across multiple systems.
Follow-up: Explain parallel recovery in PG13+. How does it speed up WAL replay?
Backup corrupted: Restore to staging; data looks inconsistent (counts off, FKs broken). Root cause: Backup taken mid-transaction. Recovery plan.
Corrupted backup indicates: (1) pg_basebackup crashed/interrupted mid-flight. (2) Incomplete WAL archive (WAL segments missing). (3) Filesystem corruption (rare). (4) Bad restore procedure. Diagnose: (1) Check backup integrity: tar -tzf backup.tar.gz | wc -l (should list all files). If tar corrupted: gunzip -t backup.tar.gz tests compression integrity. (2) Verify WAL archive: ls /mnt/wal_archive/ | wc -l; should have continuous sequence. Check for gaps: for f in /mnt/wal_archive/*; do echo $f; done | sort should increment sequentially. (3) Check restore procedure: Did you restore backup, then WAL, then recovery.signal correctly? (4) Validate on staging after recovery: psql -d restored_db -c "SELECT COUNT(*) FROM table_with_fk". If FK broken: SELECT * FROM pg_constraint WHERE constraint_type='f' AND convalidated=false (foreign keys not validated). Recovery: (1) If backup corrupted: Find older backup (24-48h old) or use WAL replay from before corruption. (2) If WAL missing: Check WAL archive backup (separate disk?): ls /backup/wal_archive_mirror/. (3) If partial restore: Re-run from backup: (a) Delete corrupted restored data. (b) Fresh restore from clean backup. (c) Replay WAL carefully; stop at good LSN. (4) Data validation: Post-restore, run integrity checks: PRAGMA integrity_check (SQLite-style, not PG). For PG: Manual queries to verify counts, FK relationships. Prevention: (1) Always test backup restore on staging (monthly). (2) Monitor backup success: Log output, check file sizes match expectations. (3) Use pgbackrest or barman (verify checksums).
Follow-up: Explain backup validation and checksums. How do you detect corrupted backups?
Failover to standby during disaster. Standby promoted but primary disk failed. Design failover procedure: promotion, data safety, replication setup post-failover.
Failover: Primary fails, promote standby. Procedure: (1) Ensure primary is truly down: pg_isready -h primary_host -p 5432 should timeout/refuse. (2) Promote standby: pg_ctl promote -D /var/lib/postgresql/data on standby (or SELECT pg_promote() if PG11+ and standby accepts connections). Wait for promotion: tail -f standby_log | grep "database system is ready to accept". (3) Update app connection strings to point to new primary (old standby). (4) Data safety: Promoted standby may have uncommitted writes from before failover (if synchronous_commit=off). Verify: SELECT COUNT(*) FROM critical_table; compare with app audit logs. If data loss, restore from backup at known-good point. (5) Setup replication post-failover: (a) Bring old primary back (disk replaced). (b) Basebackup from new primary: pg_basebackup -D /var/lib/postgresql/data -h new_primary -U replication -Xs -R on old primary. (c) Start old primary as standby: systemctl start postgresql. (d) Verify replication: SELECT * FROM pg_stat_replication on new primary; should show old primary as replicating. (6) Monitoring post-failover: (a) Check replication lag: SELECT NOW() - pg_last_xact_replay_time() AS lag on standby. (b) Alert if lag > 10s (replication behind). (7) Update DNS/VIP: Old standby IP now primary; app DNS should resolve to new primary. Use VIP (virtual IP) for seamless failover. Best practice: Automate failover with Patroni/etcd or similar; manual failover error-prone.
Follow-up: Explain synchronous_commit implications for failover. Data loss scenarios?