PostgreSQL Interview — High Availability: Patroni and Stolon

Primary failed. Patroni promoted replica. App has errors: writes fail, reads timeout. Diagnose: Was promotion clean? Is new primary healthy? Replication lag?

Post-promotion issues indicate: (1) Failover incomplete, (2) Network partition, or (3) Data loss. Diagnose: (1) Check Patroni logs: tail -f /var/log/patroni/patroni.log. Should show "...promoting replica to primary". (2) Verify cluster state: patronictl -c /etc/patroni/patroni.yml list. Shows role (leader/replica) for each node. (3) Check new primary health: psql -c "SELECT version(); SELECT pg_is_in_recovery()". Should show 'f' (not in recovery). (4) Check replication (if any other replicas): SELECT * FROM pg_stat_replication on new primary; should show catchup replicas. (5) App connection errors: psql -c "SELECT * FROM pg_stat_activity" to see active connections. High count = connection exhaustion. (6) Write failures: Likely due to app still connecting to old primary (failed). If write fails, check: SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC. Retry logic in app? Root causes & fixes: (1) App using hardcoded primary IP instead of VIP/DNS: Failover promotes replica but app still targets failed primary. Fix: Use VIP (virtual IP) that moves to new primary, or update DNS. Check app config: grep 'database_host' app.conf. (2) Partial failover: Old primary rebooted instead of staying down; two primaries = split-brain. Patroni should prevent (quorum). Check: etcd member list (if using etcd for consensus). Stop old primary: systemctl stop postgresql (old primary). (3) Read failures: Replica lag. SELECT NOW() - pg_last_xact_replay_time() AS lag ON replica. If lag > app timeout, reads fail. Increase replica capacity: tune shared_buffers on replica. (4) Connection pool exhaustion: All connections held by crashed txns. Restart pool or kill idle: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle'. Recovery: (1) Verify cluster stable: patronictl -c /etc/patroni/patroni.yml list should show 1 leader, N-1 replicas. (2) Re-point app: DNS/VIP should auto-resolve to new primary. Verify: nslookup postgres.internal should resolve to new primary IP. (3) Monitor replication: watch "SELECT * FROM pg_stat_replication" on new primary; ensure replicas catch up. (4) Failback (if desired): Once old primary fixed and rebooted, it rejoins as replica. Patroni auto-detects and configures.

Follow-up: Explain split-brain scenario. How does Patroni prevent it with quorum?

Patroni 3-node cluster: 2 replicas, 1 primary. Primary hardware dies (unrecoverable). Patroni can't decide which replica to promote (no quorum). Design recovery: manual intervention without data loss.

Quorum issue (2 out of 3 nodes, but can't agree): Patroni leader election stalls. Without quorum, promotion risky (risk split-brain). Check: (1) Cluster state: etcd member list (if using etcd) shows node status. (2) Patroni: patronictl -c /etc/patroni/patroni.yml list shows "No leader". Manual recovery: (1) Option A (force manual promotion): (a) On one replica, manually promote: pg_ctl promote -D /var/lib/postgresql/data (bypasses Patroni). (b) Update Patroni config to recognize new leader: patronictl -c /etc/patroni/patroni.yml edit-config; set leader = replica_2_hostname. (c) Restart Patroni on promoted node. (d) Other replicas rejoin cluster automatically. (2) Option B (restore failed primary): If primary disk still readable, copy to new hardware: rsync -av old_primary:/var/lib/postgresql/data new_hardware:/var/lib/postgresql/data. Restart primary. Replicas catch up. (3) Option C (PITR if all replicas fail): Restore from backup, promote, re-establish replicas. (4) Upgrade cluster consensus (if etcd): If etcd down, restore etcd state or reinitialize: etcdctl member remove ; etcdctl member add . Restart etcd and Patroni. Prevention: (1) Multi-zone deployment: 3-node cluster across AZs; quorum survives single-AZ failure. (2) etcd managed service: AWS RDS Multi-AZ, Azure Cosmos (better reliability than self-managed). (3) Monitoring: Alert if leader missing >5min. (4) Backup quorum setup: Document manual promotion procedure; periodically test in staging. Design for resilience: (1) 3-node minimum (2 can fail and still have 1 quorum). (2) 5-node for high-SLA (can lose 2). (3) Monitoring: select lag := (SELECT NOW() - pg_last_xact_replay_time() FROM pg_stat_replication); alert if lag > 10s. Recommended: Patroni HA cluster standard for production; quorum prevents split-brain. Document runbook for quorum loss recovery.

Follow-up: Explain Patroni consensus algorithms (etcd, Consul). Why use external consensus vs. Raft-based?

Patroni configuration drift: Some replicas lagging significantly. Check: Are replicas correctly configured? Is synchronous_commit causing slowdown? Tune for performance.

Configuration drift indicates replicas misconfigured or under-resourced. Check: (1) Replica config: psql -h replica1 -c "SHOW synchronous_commit" should match primary. (2) Replica lag: SELECT NOW() - pg_last_xact_replay_time() AS lag ON replica1. If lag > 1s, replica slow. Diagnose: (1) Replica CPU/IO: top, iostat -x 1 5 on replica. If high util%, replica under-resourced. (2) Replication slot status: SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots on primary. If confirmed_flush_lsn far behind restart_lsn, replica not flushing. (3) Synchronous_commit setting: If set to remote_apply on primary, primary waits for replicas to apply before commit. Causes write latency. Check: SHOW synchronous_commit on primary. If 'remote_apply', writes slow. Tuning: (1) Change synchronous_commit: ALTER SYSTEM SET synchronous_commit = 'remote_write'; SELECT pg_reload_conf() (write without apply; faster than remote_apply, slower than off). (2) Tune replica resources: Increase shared_buffers, maintenance_work_mem on replicas. (3) Parallel replication (PG13+): ALTER SYSTEM SET max_parallel_apply_workers = 4; ALTER SYSTEM SET max_parallel_workers = 4. Replay uses multiple workers. (4) Reduce max_wal_senders if hogging bandwidth: ALTER SYSTEM SET max_wal_senders = 5 (default; limit sender processes). (5) Tune wal_keep_segments: ALTER SYSTEM SET wal_keep_segments = 256 (default 0; higher = more WAL retained locally, faster recovery if replica restarts). Performance goal: Replica lag < 100ms (typical). If lag >1s: (1) Likely synchronous_commit=remote_apply (change to remote_write). (2) Or replica under-resourced (upgrade). (3) Or network bottleneck (verify bandwidth between primary/replica). Recommendation: For write-heavy workloads, use synchronous_commit=off on primary (async replication) or remote_write. For financial workloads, remote_apply with optimized replicas.

Follow-up: Explain synchronous_commit modes and replicas. RPO/RTO implications?

Stolon vs. Patroni: Which for new deployment? Design HA PostgreSQL cluster with etcd consensus, automatic failover, zero-downtime restart.

Stolon vs. Patroni comparison: Stolon: Uses etcd for consensus. Keeper (daemon on each node) manages PG. Simpler, fewer features. Patroni: Uses etcd/Consul/ZooKeeper. More features (REST API, custom callbacks, DCS-agnostic). Recommendation: Use Patroni for new deployments; actively maintained, richer feature set. Setup Patroni 3-node cluster: (1) Install: pip install patroni[etcd3] on all 3 nodes. (2) Install etcd (consensus store): etcd --listen-client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://node1:2379 on separate nodes or co-located. (3) Patroni config (/etc/patroni/patroni.yml): scope: pg_cluster; namespace: /patroni; name: node1; etcd: {host: 127.0.0.1, port: 2379}; postgresql: {data_dir: /var/lib/postgresql/14/main, parameters: {max_connections: 200, synchronous_commit: 'off'}}; ha: {ttl: 30, loop_wait: 10, retry_timeout: 20, maximum_lag_on_failover: 1MB}. (4) Start Patroni: systemctl start patroni on all nodes. (5) Verify cluster: patronictl -c /etc/patroni/patroni.yml list. Shows leader and replicas. Zero-downtime restart: (1) Restart leader: patronictl -c /etc/patroni/patroni.yml restart pg_cluster node1. Patroni promotes replica, restarts primary, rejoins. No failover needed. (2) Restart replica: patronictl -c /etc/patroni/patroni.yml restart pg_cluster node2. No impact on primary. Automatic failover: (1) Primary fails: Etcd detects (heartbeat missed). (2) Remaining replicas elect new leader (quorum). (3) New leader promoted: Takes over writes. (4) Apps re-route via VIP/DNS. (5) Failed primary rejoins as replica on recovery. Design resilience: (1) 3-node cluster (HA). (2) etcd on separate consensus nodes (not co-located). (3) VIP/DNS for app connectivity. (4) Monitoring alerts for lag, leader missing, etc. (5) Regular failover drills (promote replica, verify, promote back). Recommended architecture for production PostgreSQL HA.

Follow-up: Explain DCS (Distributed Consensus Store). Why external consensus better than Raft in PG?