PostgreSQL Interview — MVCC and Transaction Isolation Internals

Your app reports "dirty read" bugs during A/B tests. Two transactions see different values for the same row within milliseconds. You check pg_stat_activity and see READ COMMITTED isolation. Walk through how xmin/xmax and transaction IDs prevent this—or fail to.

PostgreSQL MVCC assigns each transaction a unique XID (transaction ID) and uses xmin/xmax tuple headers to track visibility. A row is visible if the inserting transaction (xmin) committed before our snapshot, and the deleting transaction (xmax) hasn't committed. READ COMMITTED takes a new snapshot for each statement; SERIALIZABLE reads the snapshot at transaction start. Dirty reads require another tx to see uncommitted work—impossible in PG because xmin only becomes visible on commit. Query SELECT xmin, xmax, * FROM table WHERE id=123 to inspect tuples. Check SELECT * FROM pg_stat_snapshot_tup for visibility info. The "dirty read" is likely non-repeatable read (same tx sees different values across statements) due to READ COMMITTED's per-statement snapshots. Use REPEATABLE READ or SERIALIZABLE to lock snapshot scope.

Follow-up: A transaction holding REPEATABLE READ sees 500M rows in a table, then another transaction DELETEs 100M rows and commits. Does the first tx still see 500M rows? How does PostgreSQL avoid scanning dead tuples?

In production, you notice xmin inflation. A long-running transaction is 2 billion XID transactions old, and autovacuum can't clean dead tuples. Your monitoring fires. Explain the problem and how to fix it without killing the long txn.

Long-running transactions (e.g., pg_dump without -j flag, idle-in-transaction sessions) prevent xmin horizon from advancing. Vacuum skips tuples because a future tx might need them. Check SELECT datname, NOW() - pg_postmaster_start_time() as uptime, age(datfrozenxid) FROM pg_database WHERE datfrozenxid != 0 and SELECT pid, usename, query_start, state FROM pg_stat_activity WHERE state='idle in transaction'. Solutions: (1) Kill the idle session: SELECT pg_terminate_backend(pid). (2) Use VACUUM FREEZE on the table to forcibly set xmin=FrozenXID (2). (3) Upgrade PostgreSQL to v13+ where SKIP_LOCKED helps concurrent vacuums. (4) Set idle_in_transaction_session_timeout to auto-kill idle txns. XID wraparound causes corruption if xmin passes 2 billion; this is staff-engineer territory because it's nearly unrecoverable without restore.

Follow-up: What's the difference between age(xmin) and age(datfrozenxid)? When does VACUUM FREEZE become mandatory vs. optional?

You're running SERIALIZABLE isolation on a high-throughput API. Txns abort with "serialization failure" even though they don't touch the same rows. Investigate why SERIALIZABLE conflicts occur and propose a fix without dropping to READ COMMITTED.

SERIALIZABLE in PostgreSQL uses Serializable Snapshot Isolation (SSI), not locking. It tracks rw-conflicts: if txn A reads data that txn B modifies, and B commits first, A aborts on commit with ERROR: could not serialize access due to read/write dependencies. False positives occur if A reads a range scan that doesn't return rows; another txn inserts matching rows; B commits; A sees phantoms and aborts. Check SELECT * FROM pg_stat_wal_receiver or pg_stat_wal_conflicts for conflict stats. Fix: (1) Reduce txn scope—move non-conflicting work outside the txn. (2) Explicitly LOCK rows at start: SELECT ... FOR UPDATE converts to pessimistic locking. (3) Retry logic with exponential backoff. (4) Profile with EXPLAIN (ANALYZE) SELECT ... to confirm range scans; add indexes to prune phantoms. (5) Accept SERIALIZABLE conflicts as SLO cost; monitoring must track abort rate.

Follow-up: A multi-statement SERIALIZABLE txn fails halfway. Is the partial state rolled back? How do you ensure idempotency?

A report shows reads latency spiked when writes increased. You suspect lock contention on MVCC tuple headers (xmin/xmax updates). Use pg_stat_statements and lock monitoring to pinpoint the hot table, then propose an optimization.

MVCC tuple headers are updated in-place during INSERT/UPDATE/DELETE, which requires briefly holding a lock. High write concurrency on the same table causes xmin/xmax churn. Query SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan FROM pg_stat_user_tables ORDER BY seq_tup_read DESC and SELECT queryid, query, calls, mean_exec_time FROM pg_stat_statements WHERE query LIKE '%UPDATE%' ORDER BY mean_exec_time DESC. Cross-reference with SELECT pid, relation::regclass, mode, granted FROM pg_locks WHERE NOT granted to find blocked sessions. Root causes: (1) Missing indexes causing seq scans. (2) SELECT without WHERE hitting millions of rows. (3) Heap-only tuples (HOT) disabled due to schema bloat. Fix: (1) Add covering indexes: CREATE INDEX idx_hot ON table(col1, col2) INCLUDE (col3). (2) Enable HOT: CLUSTER table or rebuild. (3) Partition by hot key to shard write load. (4) Use UNLOGGED tables if durability allows (rare).

Follow-up: Explain HOT (Heap-Only Tuples) and why they reduce MVCC overhead. When is HOT disabled?

You're debugging a customer complaint: same query run twice in READ COMMITTED returns different results, even though no other transactions ran. Explain the cause and show me the psql commands to reproduce and diagnose.

READ COMMITTED takes a fresh snapshot for each statement. If autovacuum or another session commits writes between statements, the second statement sees new data. Reproduce: Open two psql sessions. Session A: BEGIN ISOLATION LEVEL READ COMMITTED; SELECT COUNT(*) FROM table; (note result, e.g., 1000). Session B: INSERT INTO table VALUES (...); COMMIT;. Session A: SELECT COUNT(*) FROM table; (sees 1001). Diagnose with SELECT (now() - backend_xmin_horizon)/interval '1s' as seconds_behind FROM pg_replication_slots and SELECT xmin, xmax FROM pg_class WHERE relname='table'. The root cause: MVCC multi-version visibility allows non-repeatable reads. Fix in app code: Wrap entire operation in REPEATABLE READ or SERIALIZABLE block if you need consistency. Or: Disable autovacuum during batch operation: ALTER TABLE table SET (autovacuum_enabled = false) then manually vacuum.

Follow-up: Why is non-repeatable read acceptable in ANSI SQL but not in financial txns? Show the ANSI isolation levels and PG's mapping.

Production alert: tuple visibility check latency increased 10x during maintenance window. After investigation, you find the xmin histogram is broken, and hint bits are mostly unset. Explain hint bits, when they're set, and how to force-set them.

Hint bits are optimization flags in tuple headers that cache visibility results. When a tuple is first checked for visibility, PostgreSQL stores the result in the xmin_committed/xmax_committed bits to avoid re-checking. Unset hint bits force repeated lookups in pg_clog (commit log), which degrades performance under heavy scan workloads. After pg_upgrade or when running on a new replica, hint bits are unset. Force-set them with VACUUM ANALYZE (scans all tuples) or CLUSTER table (rewrites and re-hints). Check hint bit density with custom extension or by examining heap pages with pageinspect: CREATE EXTENSION pageinspect; SELECT * FROM heap_page_items(get_raw_page('table', 0)). Production fix: Schedule vacuum during maintenance window, or use VACUUM (ANALYZE, FREEZE) to aggressively freeze pages. Monitor with SELECT schemaname, tablename, last_vacuum, last_analyze FROM pg_stat_user_tables ORDER BY last_vacuum DESC.

Follow-up: What's the difference between committed and frozen? Why would you freeze instead of just hinting?

Your SLA is 99.99% read latency. A user reports a stalled query; it's SELECT COUNT(*) on a 50GB table with no WHERE clause. EXPLAIN shows seq scan. You suspect xmin horizon stall from unvacuumed pages. Diagnose the blockage and show recovery steps.

An unvacuumed table forces PostgreSQL to check xmin/xmax on every seq scan; if pages haven't been hinted, repeated visibility checks hit pg_clog. Diagnose: SELECT schemaname, tablename, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE tablename='table_name'. If vacuum timestamps are old or NULL (never vacuumed), that's the culprit. Run VACUUM ANALYZE table_name and retest. For immediate relief during incident: (1) Kill blocking queries: SELECT pg_cancel_backend(pid). (2) Increase maintenance_work_mem: SET maintenance_work_mem TO '4GB' then vacuum again. (3) Check autovacuum is enabled: SHOW autovacuum. (4) If autovacuum_naptime too large, lower it: ALTER SYSTEM SET autovacuum_naptime = '10s'; SELECT pg_reload_conf(). Production hardening: Set up monitoring on pg_stat_user_tables.last_vacuum lag; alert if > 24h.

Follow-up: How does VACUUM FREEZE differ from regular vacuum in multi-version visibility? Why use it before archiving a table?

You're migrating from Oracle (read consistency snapshot) to PostgreSQL. Oracle txns see one consistent view; PostgreSQL REPEATABLE READ allows phantoms. A batch job depends on Oracle semantics. How do you emulate it in PG without moving back to READ COMMITTED?

Oracle's read consistency (equivalent to SERIALIZABLE in SQL standard) differs from PostgreSQL's REPEATABLE READ (allows phantom reads). In PG, use SERIALIZABLE isolation level with explicit locking to match Oracle. At transaction start, explicitly lock all tables the batch will touch: BEGIN ISOLATION LEVEL SERIALIZABLE; LOCK TABLE t1, t2, t3 IN ACCESS SHARE MODE; ... COMMIT;. This prevents phantoms. Alternatively, use pessimistic locking: SELECT ... FOR UPDATE on all rows the batch reads, ensuring no concurrent deletes/inserts affect the result set. For batch jobs that must be deterministic, consider: (1) Snapshot isolation guarantees within REPEATABLE READ, but add explicit locks to prevent phantoms. (2) Archive-mode consistency: set transaction_isolation='SERIALIZABLE' in postgresql.conf for that session. (3) Test migration with Oracle test suite; re-run queries under PG SERIALIZABLE and compare result sets. Staff-note: Most batch jobs don't need Oracle-level consistency; verify SLA before over-engineering.

Follow-up: What are predicates in SERIALIZABLE, and how do they prevent phantom reads? Show a phantom read scenario.