PostgreSQL Interview Questions

Major Version Upgrade Strategies

questions
Scroll to track progress

Upgrade PG14 to PG16 (500GB DB, <1 hour downtime SLA). Design upgrade strategy: pg_upgrade vs. logical replication vs. physical standby promotion.

Major version upgrades require strategy to minimize downtime. Options: (1) pg_upgrade (in-place): Copy datadir, in-place upgrade. Fastest (30min for 500GB) but requires full restart. Best for tolerable downtime. (2) Logical replication: PG14 primary -> PG16 subscriber via logical replication. Zero-downtime but slower. (3) Physical standby (PG15+): Upgrade standby, promote. Requires replication setup beforehand. Strategy for <1h downtime: (1) Parallel pg_upgrade + logical replication: (a) Day before: Setup PG16 standby via logical replication from PG14. (b) Cutover day: Sync logical subscription to near-current LSN. (c) Brief maintenance window (5min): Pause apps, finish logical replay, promote PG16. (d) Post-upgrade: Full ANALYZE, post-upgrade validation. Total downtime: 5-10min. (2) Pure pg_upgrade (fastest): (a) Full backup of PG14 (basebackup or pg_dump). (b) Schedule maintenance window (1h). (c) Run pg_upgrade: pg_upgrade -b /usr/lib/postgresql/14/bin -B /usr/lib/postgresql/16/bin -d /var/lib/postgresql/14/data -D /var/lib/postgresql/16/data -c (check mode first; validates compatibility). (d) Run actual upgrade: pg_upgrade ... (no -c). Typically 20-30min for 500GB on SSD. (e) Restart PG16. (f) Post-upgrade: ANALYZE; reindex-missing-obj.sh; vacuumdb --analyze-in-stages. Downtime: 30-45min total. (3) Logical replication (zero-downtime, slower): (a) Prior: Setup PG16 subscriber (empty). (b) Create publication on PG14: CREATE PUBLICATION full_pub FOR ALL TABLES. (c) Create subscription on PG16: CREATE SUBSCRIPTION full_sub CONNECTION '...' PUBLICATION full_pub WITH (copy_data=true). (d) Wait for copy to complete (hours for 500GB). (e) Promotion: Disable subscription, confirm lag=0, promote. App switchover: DNS change or VIP move. Downtime: <1min (DNS propagation only). Post-upgrade validation: (1) Run regression tests: psql -f test_suite.sql. (2) Check for planner changes: EXPLAIN ANALYZE SELECT ... on critical queries; compare with old version. (3) Verify extensions: SELECT * FROM pg_extension; some may need upgrading. (4) Test failover: Ensure replica/standby replicates correctly. Rollback plan: (1) If pg_upgrade fails: Pre-upgrade analysis (pg_upgrade -c) should catch issues. Have old version data backed up. (2) If app incompatible with new version: Logical replication allows quick rollback (switch back to PG14). Recommendation: For production: Logical replication + parallel pg_upgrade testing in staging. PG14->PG16 is minor bump; usually smooth. Test on staging first.

Follow-up: Explain pg_upgrade internals. Why is it faster than dump/restore?

Post-upgrade: query plans changed. Some queries now slower. Root cause: Planner improvements in PG16. Diagnose and fix without data loss.

Major version upgrades improve planner; plans change. PG14->PG16 improvements: (1) Better selectivity estimation (multivariate stats). (2) Parallel worker improvements. (3) New join types (Memoize join PG13+). (4) Improved cost model. Change may be positive (faster) or negative (slower for specific queries). Diagnosis: (1) Capture EXPLAIN before upgrade: EXPLAIN SELECT ... on all critical queries in PG14. Save to file. (2) Immediately post-upgrade, re-run: EXPLAIN SELECT ... on PG16. Compare plans. (3) Identify queries with slower plans: If actual time increased, investigate. (4) Root cause analysis: (a) Different join order: Run with join_collapse_limit=1 to force left-to-right (old behavior). If faster, reorder optimization. (b) Different index usage: May have switched to seq scan. Run ANALYZE: ANALYZE big_table. Re-plan. (c) Parallel workers enabled: Disable to test: SET max_parallel_workers_per_gather = 0. If faster, parallel overhead exceeded seq scan benefit. (5) Fix options: (a) Rewrite query to hint new plan (e.g., CTE to force order). (b) Increase work_mem if hash join now used. (c) Tune cost model: ALTER SYSTEM SET random_page_cost = X closer to PG14 value. (d) Disable new planner feature (if regression): ALTER SYSTEM SET enable_memoize = off (disables Memoize join added PG13). (6) Validate: Re-run app workload, measure latency. If regression >10%, open PostgreSQL issue (community will help). Prevention: (1) Test upgrade in staging with representative workload first. (2) Create performance baseline before/after. (3) Have rollback plan (standby on old version). Recommendation: Most upgrades neutral or positive for performance; if regression, update stats and re-tune cost model.

Follow-up: Explain new features in PG15-16: Memoize join, improved statistics. How do they affect plans?

pg_upgrade fails: "database format not compatible". Old version data can't be directly copied. Fallback to logical replication or dump/restore.

pg_upgrade fails if: (1) Major version incompatible (e.g., PG12 to PG16; skipped PG13-15). pg_upgrade only supports adjacent or nearby major versions. (2) Custom data types or extensions not compatible. (3) Binary format change (rare). (4) Corrupted PG14 data. Fallback strategies: (1) Logical replication (zero-downtime): (a) Setup PG16 instance. (b) Create publication on PG14: CREATE PUBLICATION p FOR ALL TABLES; ALTER SYSTEM SET wal_level = logical; RESTART. (c) Subscribe on PG16: CREATE SUBSCRIPTION s CONNECTION '...' PUBLICATION p WITH (copy_data=true). (d) Wait for sync. (e) Promote PG16. Duration: Hours for large databases. (2) Dump/restore (downtime): (a) pg_dump -Fc /path/to/dump.sql /path/to/dump.sql. Slower than pg_upgrade (CPU-bound) but safe. Duration: Days for 500GB. (3) Parallel dump/restore: (a) pg_dump -j 8 -Fd -f /dump_dir -d (parallel dump to directory format). (b) pg_restore -j 8 -d /dump_dir (parallel restore). Faster (2-10x). Duration: 10-50h for 500GB on SSD. (4) Custom migration path: For very large databases or complex extensions, hire consultant or use AWS DMS (Database Migration Service) / foreign data wrappers. Post-recovery: (1) Run ANALYZE to update stats. (2) Validate data: SELECT COUNT(*) FROM each_table; -- compare with pre-upgrade. (3) Test app. (4) Reindex if needed. Prevention: (1) Always test pg_upgrade -c in staging first; catches incompatibilities. (2) Review extension compatibility with target version. (3) Have up-to-date backups. (4) Read release notes for breaking changes. Recommendation: Logical replication preferred for zero-downtime if available in version range; dump/restore if parallel and SSD allow.

Follow-up: Explain pg_upgrade compatibility requirements. Which PG versions support upgrading?

Extension compatibility post-upgrade: Extension version mismatch. PG14 had extension X v1.0, PG16 requires X v2.0 (breaking schema changes). Migration strategy.

Extensions may require version upgrades on major PG version change. Example: PG14 uses PostGIS 3.2, PG16 requires PostGIS 3.4+ (schema changes). Migration: (1) Check compatibility before upgrade: SELECT extname, extversion FROM pg_extension on PG14. Look up each in PG16 docs. (2) Pre-upgrade: If extension upgrade available for PG14, do it: ALTER EXTENSION postgis UPDATE (upgrades from 3.2 to 3.3 while still on PG14). (3) During upgrade: Some extensions auto-update during pg_upgrade; others fail. If pg_upgrade fails due to extension, notes will say which. (4) Post-upgrade (if not auto-updated): On PG16, upgrade extension: ALTER EXTENSION postgis UPDATE TO '3.4'. (5) If extension not available on PG16: (a) Check if replacement exists: PostGIS 3.4 replaces 3.2; use newer. (b) If deprecated: Legacy replacement or app code change. (c) Remove extension: DROP EXTENSION postgis CASCADE (removes all objects). Reinstall without it. (6) Validation: SELECT * FROM pg_extension ORDER BY extname after upgrade; should list all extensions with new versions. Test extension functions: SELECT ST_AsText(ST_GeomFromText('POINT(0 0)')); (for PostGIS). Prevention: (1) Review extension changelog before upgrade. (2) Test extension upgrade in staging first. (3) For critical extensions (PostGIS, Citus), compile & test on target PG version before production. (4) Document extension compatibility matrix (version vs. PG version). Recommendation: Most extensions auto-update smoothly; issues rare if staying within supported version range (e.g., PostGIS 3.x supports PG14-16). Custom extensions need manual review.

Follow-up: Explain extension update semantics. How does ALTER EXTENSION UPDATE change schema?

Standby replica can't upgrade: older PG version not compatible with new primary. Design cascading upgrade: upgrade standby first, then primary, avoiding replication break.

Cascading upgrade minimizes downtime: Upgrade standby first (zero impact on primary), then promote standby temporarily, upgrade primary, re-establish replication. Steps: (1) Setup: Current setup: PG14 primary, PG14 standby (via streaming replication or logical). Goal: Both at PG16. (2) Upgrade standby first: (a) Take standby down: Stop replication. systemctl stop postgresql (standby). (b) Upgrade to PG16: pg_upgrade -b /usr/lib/postgresql/14/bin -B /usr/lib/postgresql/16/bin -d /standby/data_14 -D /standby/data_16. (c) Start PG16 standby: systemctl start postgresql (standby). (d) Verify: psql -c "SELECT version()" should show PG16. (3) Promote standby to temporary primary: (a) Promote: pg_ctl promote -D /standby/data_16. (b) App redirects to new PG16 primary (standby). (4) Upgrade original primary (now demoted): (a) Downtime window: Stop PG14 primary. (b) Upgrade: pg_upgrade -b ... (PG14 primary data). (c) Start: systemctl start postgresql (primary). (5) Re-establish replication: (a) Replication now reversed: PG16 temp-primary -> PG16 primary. (b) Setup replication: Primary connects to temp-primary as standby: pg_basebackup -D /primary/data_16 -h temp_primary -U replication -Xs -R. (c) Start: systemctl start postgresql (primary as standby). (d) Promote temp-primary back to primary (or manually promote primary). (6) Switchback: If desired (not necessary): Promote main primary back. Downtime: ~30-50min for step 4 (original primary upgrade). Zero-downtime for app during standby/primary switchover (DNS/VIP change instant). Post-upgrade: (1) Verify replication: SELECT * FROM pg_stat_replication on new primary. (2) Run ANALYZE. (3) Monitor for regressions. Advantage: Primary downtime only during its own upgrade; standby doesn't cause outage. Recommended for mission-critical systems.

Follow-up: Explain promotion workflow. Can you promote standby while primary is upgrading?