PostgreSQL Interview — WAL, Replication, and Streaming

Replica is 30 seconds behind primary during peak writes (10k txns/sec). pg_stat_replication shows write_lag=30s but flush_lag and replay_lag are <1s. Diagnose the bottleneck and propose a fix.

Write lag > flush/replay lag means primary writes WAL faster than replica can apply it—typically I/O or network bottleneck. Check primary: SELECT pid, usename, write_lag, flush_lag, replay_lag, sync_state FROM pg_stat_replication. If write_lag is large but others small, replica is catching up but lagging on new writes. Root causes: (1) Primary I/O burst: check iostat or vmstat on primary; WAL writes during peak load. (2) Network saturation: check wal_send_timeout, monitor network throughput. (3) Replica I/O slow: check iostat on replica; apply process waiting for disk. Fix: (1) Increase wal_buffers on primary to batch writes: ALTER SYSTEM SET wal_buffers = '64MB'; SELECT pg_reload_conf(). (2) Tune checkpoint settings: spread checkpoints to reduce WAL flush spikes: ALTER SYSTEM SET checkpoint_completion_target = 0.9; ALTER SYSTEM SET max_wal_size = '4GB'. (3) Enable synchronous_commit=remote_apply for critical writes (adds latency but ensures replica durability). (4) Add more wal_sender processes: ALTER SYSTEM SET max_wal_senders = 10. (5) Upgrade replica disk to SSD if mechanical. Monitoring: Set log_replication_commands to log all WAL; check for slow queries blocking apply.

Follow-up: Explain the difference between write_lag, flush_lag, and replay_lag. Which is most critical for RPO/RTO?

WAL disk consumed 500GB in one hour; pg_wal/ filled and replication halted. There are 15 inactive replication slots. Root cause and recovery plan.

Inactive replication slots prevent WAL removal; xmin is held. Check slots: SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots. Inactive slots are abandoned or standby crashes without cleanup. Immediate recovery: (1) Stop replication: systemctl stop postgresql on replicas to free slots. (2) Drop inactive slots: SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE NOT active. (3) Force WAL cleanup: CHECKPOINT; SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') to see current LSN, then SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / (1024*1024) as wal_retained_mb FROM pg_replication_slots. (4) Manually remove old WAL: pg_archivecleanup /var/lib/postgresql/pg_wal/ $(psql -t -c "SELECT pg_walfile_name(restart_lsn) FROM pg_replication_slots WHERE slot_name='slot_name'"). Prevention: (1) Set max_slot_wal_keep_size to auto-drop old slots: ALTER SYSTEM SET max_slot_wal_keep_size = '10GB'. (2) Enable wal_keep_segments as fallback: ALTER SYSTEM SET wal_keep_segments = 256. (3) Monitor WAL disk hourly; alert if > 80% used.

Follow-up: Explain the difference between max_slot_wal_keep_size and wal_keep_segments. Why would you set both?

Streaming replication is configured but stopping randomly during heavy load. Replica logs show: "wal receiver timeout expired". Primary logs show no errors. What's happening?

WAL receiver timeout means replica didn't receive any data for wal_receiver_timeout seconds (default 60s). During peak write load, primary may not send WAL frequently enough if batching. Check: SHOW wal_receiver_timeout on replica. Also check SHOW wal_receiver_status_interval (default 10s); if primary is busy, it may skip status updates. Root causes: (1) Network packet loss or saturation. (2) Primary I/O stalled during checkpoint, delaying WAL send. (3) Replica apply process slow, causing receiver to pause. Fix: (1) Increase wal_receiver_timeout: ALTER SYSTEM SET wal_receiver_timeout = '120s'. (2) Decrease wal_receiver_status_interval to send keepalives more often: ALTER SYSTEM SET wal_receiver_status_interval = '5s'. (3) Check network: tcpdump -i eth0 -n dst 5432 | grep -c SYN to count connections; monitor latency with mtr. (4) On primary, tune checkpoint_timeout: ALTER SYSTEM SET checkpoint_timeout = '15min' to reduce I/O stalls. (5) On replica, check apply process speed: SELECT * FROM pg_stat_replication WHERE pid IS NOT NULL; if slow, upgrade replica CPU/I/O. Restart affected service post-config change.

Follow-up: Why would a replica apply process being slow cause the receiver to timeout? Explain the WAL receiver/apply interaction.

Your RPO requirement is 0 (no data loss on failover). Current setup: asynchronous streaming replication. Design a synchronous strategy, explain trade-offs with latency and failover.

Asynchronous replication (default) acknowledges commits to client before replica confirms—risk of data loss during primary crash. Synchronous replication: primary waits for replica to flush WAL before ACKing write. Setup: ALTER SYSTEM SET synchronous_commit = 'remote_apply'; ALTER SYSTEM SET synchronous_standby_names = 'replica1,replica2'. With remote_apply, primary waits for replica to apply (full durability). Downside: write latency increases by replication RTT (typically 1-10ms). Trade-offs: (1) RPO=0 but RTO increases (failover slower because primary waits). (2) Any replica slow causes write stall. (3) Network hiccup = write latency spike. Optimization: Use remote_write instead (safer than async but faster than remote_apply): ALTER SYSTEM SET synchronous_commit = 'remote_write'—primary waits for replica to write OS buffer (not fsync), acceptable for most applications. Best practice: (1) Set quorum synchronization: synchronous_standby_names = 'ANY 1 (replica1,replica2)' — wait for ANY 1 of 2 replicas. (2) Add failover slots to promote faster. (3) Monitor write latency; alert if > 50ms. Measure: psql -t -c "SELECT extract(epoch from now()) * 1000" | xargs -I {} sh -c "sleep 0.1; SELECT extract(epoch from now()) * 1000" to measure latency.

Follow-up: Compare synchronous_commit modes: on, remote_write, remote_apply. Which is best for financial vs. analytics workloads?

After network blip, streaming replication reconnects but replica is now 2 hours behind. Manually kickstart catch-up without basebackup.

Network reconnect may leave replica at stale LSN. Check lag: SELECT NOW() - pg_last_xact_replay_time() AS replication_lag on replica. If lag is 2h, replica can catch up if WAL is available (check max_slot_wal_keep_size and wal_keep_segments). Kickstart: (1) Check if streaming is resuming: SELECT * FROM pg_stat_replication on primary; if no entry, slot is broken. (2) Restart replica: systemctl restart postgresql on replica; streaming resumes from LSN in recovery.conf. (3) If restart doesn't work, check: pg_controldata on replica to see current timeline and LSN. (4) If WAL is missing, perform basebackup: pg_basebackup -D /var/lib/postgresql/data -h primary -U replication -v -P -W (requires downtime). (5) If WAL exists but stream is slow, increase wal_sender_timeout: ALTER SYSTEM SET wal_sender_timeout = '300s' on primary. (6) Increase replica apply workers: ALTER SYSTEM SET max_parallel_apply_workers = 4 on replica. Monitor progress: tail -f /var/log/postgresql/postgresql.log | grep "replayed" to see replay speed. Typically catches up at 50-100MB/s depending on disk.

Follow-up: Explain recovery timeline and timeline switch. When does a timeline change occur?

You're implementing PITR (point-in-time recovery) with WAL archiving. Design: archive destination, archive_command, recovery_target_timeline, and how to verify recovery.

PITR requires continuous WAL archiving and base backup. Setup: (1) Create archive directory: mkdir -p /mnt/wal_archive; chmod 700 /mnt/wal_archive. (2) Configure archiving: ALTER SYSTEM SET archive_mode = on; ALTER SYSTEM SET archive_command = 'cp %p /mnt/wal_archive/%f'. Use rsync for remote: archive_command = 'rsync -a %p backuphost:/backups/wal/%f'. (3) Reload: SELECT pg_reload_conf(). (4) Take base backup: pg_basebackup -D /backup/base -h localhost -U postgres -v -P -W -Xs -R (-Xs streams WAL during backup). (5) For recovery, restore base backup and WAL: cp -r /backup/base/* /var/lib/postgresql/data/; cp /mnt/wal_archive/* /var/lib/postgresql/data/pg_wal/. Create recovery.signal: touch /var/lib/postgresql/data/recovery.signal. Set recovery parameters: recovery_target_timeline = 'latest'; recovery_target_time = '2026-04-07 14:30:00'; recovery_target_name = 'my_savepoint' in postgresql.auto.conf. (6) Verify: Start PG, check logs for "database system is ready to accept connections in hot standby mode"; verify data at target time. Staff note: Test recovery path quarterly; document and automate.

Follow-up: Explain recovery_target_inclusive. Why would you set it false? How do you create named savepoints for recovery?

WAL archiving is enabled but archive_status shows 256 .ready files not yet .done. Logs show archive_command is slow. Design optimization without losing data.

Slow archive_command causes WAL queue to build; .ready means scheduled for archiving, .done means archived. Check: ls -la /var/lib/postgresql/data/pg_wal/archive_status/ | wc -l. If many .ready files, archiving is backlogged. Root cause: Archive destination slow (network, disk I/O) or archive_command complex. Optimize: (1) Use parallel archiving: archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f &' (background copy, non-blocking). Be careful: backgrounding without return code check can lose commits. Better: Use pgbackrest or barman (professional tools): archive_command = 'pgbackrest archive-push %p'. (2) Check archive destination: Is network mounted? Add local staging: archive_command = 'cp %p /local/staging/%f && (rsync /local/staging/%f remote:/backups/ &)'. (3) Increase wal_writer_delay: ALTER SYSTEM SET wal_writer_delay = '200ms'; batches WAL writes, reduces archive pressure. (4) Monitor: SELECT * FROM pg_stat_archiver to see archived count and failures. (5) If .ready files accumulate, manually force: systemctl restart postgresql restarts archiver. Prevention: Set up alerting on archive lag; if > 100MB outstanding, page on-call.

Follow-up: Explain .ready/.done files in pg_wal/archive_status/. What happens if archiver crashes with files .ready?