PostgreSQL Interview — Query Planner, EXPLAIN, and Optimization

A query took 10ms with 100k rows; now it takes 45 seconds with 1M rows. EXPLAIN shows the planner still uses seq scan. Why did the plan not change, and how to force index usage?

Planner uses statistics to estimate costs. With 100k rows, seq scan was cheaper than index lookup. With 1M rows, index should win but planner may still choose seq scan if stats are stale. Check: ANALYZE table_name to refresh. Then rerun EXPLAIN: EXPLAIN SELECT * FROM table WHERE col = value. If still seq scan after ANALYZE, planner estimates index cost > seq cost. Reasons: (1) High row estimate in filter (poor selectivity). (2) random_page_cost too high (default 4.0; tells planner index I/O is expensive). (3) Missing or poor index. Fix: (1) ALTER SYSTEM SET random_page_cost = 1.1; SELECT pg_reload_conf() if using SSD. (2) Check index: CREATE INDEX idx_col ON table(col). (3) Force index: SET constraint_exclusion = partition or use EXPLAIN (ANALYZE) SELECT * FROM table WHERE col = value to see actual vs estimated rows. If estimate is 10k but actual is 500k, statistics are wrong; run ANALYZE VERBOSE table to rebuild. (4) Check column stats: SELECT avg_width, n_distinct FROM pg_stats WHERE tablename='table' AND attname='col'. If n_distinct=-1 (all unique), planner thinks every row matches. Re-analyze with increased sample: ALTER TABLE table ALTER COLUMN col SET STATISTICS 1000 then ANALYZE.

Follow-up: Explain the difference between estimated and actual rows in EXPLAIN ANALYZE. Why would estimates be off after growth?

EXPLAIN shows: "Seq Scan on orders (cost=0..10000 rows=50000)". Actual execution: 45 seconds. You suspect I/O bottleneck during seq scan. How to measure and optimize?

Seq scan cost model assumes linear I/O at random_page_cost per page. Cost=10000 means planner predicts ~10000 page reads. If actual time is 45s, I/O rate is slow. Diagnose: (1) Check disk I/O: iostat -x 1 5 | grep sda on Linux; look for high %util or await. (2) Check buffer pool: SHOW shared_buffers. If table size > shared_buffers, every seq scan re-reads from disk. (3) Measure seq scan speed: EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM orders. "Buffers: shared hit=X read=Y" shows cache hit ratio. If read >> hit, data not cached. Fix: (1) Increase shared_buffers: ALTER SYSTEM SET shared_buffers = '16GB'; SELECT pg_reload_conf() (requires restart). (2) Increase effective_cache_size hint: ALTER SYSTEM SET effective_cache_size = '64GB' (tells planner about OS cache). (3) Enable parallel seq scan: ALTER SYSTEM SET max_parallel_workers_per_gather = 4; rerun query. (4) Partition table by date; scan one partition. (5) If selectivity is low, consider index instead. Rerun EXPLAIN (ANALYZE, BUFFERS) to confirm improvement. Staff note: Seq scans are normal for bulk operations; only optimize if executed frequently (check pg_stat_statements).

Follow-up: Explain BUFFERS output in EXPLAIN ANALYZE. Why is shared hit vs. read important?

A complex JOIN query with 3 tables changed from 15ms to 2s overnight. No code or data change. EXPLAIN shows a different join order. Root cause and fix.

Planner can choose multiple join orders; PostgreSQL uses dynamic programming to find lowest-cost path. Different stats (after autovacuum/analyze) can change cost estimates, flipping join order. Compare EXPLAIN from yesterday to today: EXPLAIN SELECT a.*, b.*, c.* FROM a JOIN b ON a.id=b.a_id JOIN c ON b.id=c.b_id WHERE a.status='active'. If Join order changed (e.g., from a->b->c to b->a->c), check statistics: SELECT tablename, attname, n_distinct, avg_width FROM pg_stats WHERE tablename IN ('a','b','c'). Root cause: Stale stats from one table. Re-run ANALYZE: ANALYZE a; ANALYZE b; ANALYZE c and rerun EXPLAIN. If still wrong, stats are inherently unreliable (e.g., very skewed data). Fix: (1) Increase STATISTICS for that column: ALTER TABLE a ALTER COLUMN status SET STATISTICS 1000; ANALYZE a. (2) Use EXPLAIN hints: disable certain join types to force order: SET join_collapse_limit = 1 (force left-to-right join order). (3) Manually hint with CTEs: WITH cte_a AS (SELECT ... FROM a WHERE ...) SELECT * FROM cte_a JOIN b ON ... JOIN c ON ... (forces a to join first). (4) Add indexes on join keys to make all orders equally fast. For regression: compare EXPLAIN output with version control; if changed, investigate stats update timing.

Follow-up: Explain join_collapse_limit and from_collapse_limit. How do they force join order?

EXPLAIN (ANALYZE) shows 99% of time in "Filter: col > 100" applied after seq scan. Plan scans 50M rows, filters to 500. Redesign this query to avoid post-scan filter.

Filter applied post-scan is inefficient; should be applied during table access. This happens when: (1) Filter is non-indexable (e.g., function call). (2) No index on filter column. (3) Planner chose seq scan despite index existing. Check: EXPLAIN SELECT * FROM big_table WHERE col > 100 AND func(col) = 5. The col > 100 should be Index Filter (pushed into index); func(col) = 5 should be Filter (post-scan). If col > 100 is Filter, create index: CREATE INDEX idx_col ON big_table(col). Then ANALYZE big_table; EXPLAIN SELECT .... If still Filter, force index: SET enable_seqscan = off; EXPLAIN SELECT .... If now uses index, random_page_cost is the issue; lower it. For func(col) = 5, create expression index: CREATE INDEX idx_func ON big_table(func(col)) to push into index. Verify: EXPLAIN should show "Index Filter: col > 100" and "Filter: func(col) = 5" in index; if both in Filter, index is not used. Replan with indexes in place. Staff pattern: Always push predicates into Index Filter; minimize post-scan filters. Measure with EXPLAIN (ANALYZE, BUFFERS) to confirm rows eliminated by index vs. post-scan.

Follow-up: Explain the difference between Index Filter and Filter in EXPLAIN output. Why is Index Filter faster?

Query plan changed between PostgreSQL 11 and 13. Same data, same indexes, different plan. EXPLAIN shows significant latency regression. How to diagnose the root cause and stabilize?

PostgreSQL major versions improve planner heuristics; plan changes are common. Version 13 introduced better selectivity estimation for IN clauses and multivariate statistics. Diagnose: (1) Get EXPLAIN from both versions: save as explain_11.txt and explain_13.txt. Compare: diff -u explain_11.txt explain_13.txt to see plan differences. (2) Run on PG13 with PG11 behavior: SET geqo_threshold = 8; SET random_page_cost = 4.0 (match old defaults if changed). (3) Check if multivariate stats are the issue: SELECT * FROM pg_stats_ext; if any exist, drop them: DROP STATISTICS IF EXISTS stats_name. (4) Re-run EXPLAIN to see if plan reverts. If plan is now acceptable, issue is multivariate stats; re-create selectively. (5) If still different, check statistics: ANALYZE big_table; ANALYZE small_table. (6) Use QUERY HINTS (unsupported in PG natively but possible with pg_hint_plan extension): /*+ SeqScan(orders) */ SELECT .... For production: (1) Test all queries on new PG version pre-upgrade. (2) Use pg_upgrade with analyze, then re-run problematic queries. (3) Set log_min_duration_statement to catch regressions post-upgrade. (4) Have rollback plan (standby on old version) if regressions found.

Follow-up: Explain multivariate statistics and when they help the planner. How do you create and drop them?

EXPLAIN (ANALYZE) shows worst-case scenario: 100 loops, each looping inner scan. Cost balloons from 10ms estimate to 30s actual. Root cause: loop condition isn't what you think. Diagnose.

Nested loops are efficient only if inner scan is fast. If inner scan runs N times (outer cardinality), cost multiplies. Example: Outer scan returns 100 rows; inner scan (correlated subquery) returns 10 rows per outer row = 1000 inner fetches. EXPLAIN shows "Nested Loop (cost=... loops=100)" meaning 100 iterations. If estimate is 10ms but outer_estimate=100 * inner_estimate=100ms total (wrong if inner is actually 300ms per loop due to correlated filter). Check: EXPLAIN (ANALYZE, VERBOSE) SELECT a.*, (SELECT COUNT(*) FROM b WHERE b.a_id = a.id) as cnt FROM a. "Loops" shows actual. If loops=100 and inner scan shows "Seq Scan on b" with Filter on a.id (correlated), that's O(N^2). Fix: (1) Rewrite with JOIN instead of correlated subquery: SELECT a.*, COUNT(b.id) FROM a LEFT JOIN b ON b.a_id = a.id GROUP BY a.id. (2) If subquery is necessary, materialization helps: SELECT a.*, (SELECT COUNT(*) FROM (SELECT * FROM b WHERE b.a_id = a.id) sub) FROM a (may still loop but materializes b first). (3) Check index on b.a_id: CREATE INDEX idx_b_a_id ON b(a_id); inner scan uses index, not seq scan. (4) Use EXPLAIN (ANALYZE, BUFFERS) to confirm index is used and loop count matches expectation.

Follow-up: Explain correlated subqueries and when they cause O(N^2) plans. How to rewrite as JOIN?

You inherit a query with 200 lines of nested CTEs and function calls. EXPLAIN is unreadable. Design a debugging approach to find the slow node without rewriting the entire query.

Complex queries need incremental analysis. Strategy: (1) Output EXPLAIN as JSON for programmatic parsing: EXPLAIN (ANALYZE, FORMAT JSON) SELECT ... | jq '.[] | .[] | select(.Execution Time > 1000)' to find slow nodes. (2) Add EXPLAIN TIMING to find which CTE is expensive: Disable CTEs one at a time: -- WITH cte1 AS (...), cte2 AS (...), ... SELECT FROM cte2 rewrite to WITH cte1 AS (...), cte2 AS (...) SELECT COUNT(*) FROM cte1; WITH cte1 AS (...), cte2 AS (...) SELECT COUNT(*) FROM cte2 to isolate slow one. (3) Check which CTE result is large: WITH cte1 AS (...) SELECT COUNT(*) FROM cte1 to see cardinality. (4) Use log_statement='all' and log_min_duration_statement=0 to log query; then post-analyze. (5) For functions, use EXPLAIN (ANALYZE) SELECT func(...) to measure. (6) Once slow node identified, optimize it separately, then re-integrate. Staff pattern: Break complex query into parts; optimize each independently; then compose. Use materialized views for expensive CTEs that are reused: CREATE MATERIALIZED VIEW cte1_view AS (SELECT ... FROM ...); REFRESH MATERIALIZED VIEW cte1_view; SELECT * FROM cte1_view WHERE ....

Follow-up: Explain materialized views vs. temporary tables vs. CTEs in terms of performance and maintainability.