App hangs: pg_stat_activity shows waiting queries but none explicitly blocked by locks. Diagnose: Use wait_event, blocking pids, lock graph. Root causes: Lock timeout, deadlock, or stalled I/O.
Hangs indicate processes waiting for something. Check: (1) Lock waits: SELECT pid, usename, query_start, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event IS NOT NULL. If wait_event='Lock', a query blocks on table/row lock. (2) Find blocker: SELECT DISTINCT blocking_pid FROM pg_locks l1 WHERE NOT l1.granted AND EXISTS (SELECT 1 FROM pg_locks l2 WHERE l2.locktype = l1.locktype AND l2.database IS NOT DISTINCT FROM l1.database AND l2.relation IS NOT DISTINCT FROM l1.relation AND l2.page IS NOT DISTINCT FROM l1.page AND l2.tuple IS NOT DISTINCT FROM l1.tuple AND l2.virtualxid IS NOT DISTINCT FROM l1.virtualxid AND l2.transactionid IS NOT DISTINCT FROM l1.transactionid AND l2.classid IS NOT DISTINCT FROM l1.classid AND l2.objid IS NOT DISTINCT FROM l1.objid AND l2.objsubid IS NOT DISTINCT FROM l1.objsubid AND l2.granted AND l2.pid != l1.pid) ORDER BY 1 (complex but finds blockers). Simpler: SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type='Lock'. (3) Check for I/O stalls: SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type IN ('IO', 'BufferPin'). (4) For true deadlock: Logs show "ERROR: deadlock detected". Root causes: (a) Circular lock dependency: Txn A locks X, waits for Y; Txn B locks Y, waits for X. (b) Lock order inconsistency in app. (c) Slow txn holding locks. Fix: Kill blocker: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid=. Or cancel: SELECT pg_cancel_backend(pid). Prevention: Use SKIP LOCKED: SELECT * FROM table FOR UPDATE SKIP LOCKED (skip locked rows; useful for queue processing). Or use NOWAIT: SELECT * FROM table FOR UPDATE NOWAIT (error instead of block).
Follow-up: Explain lock types in PostgreSQL: AccessExclusiveLock, RowExclusiveLock, etc. Which cause blocks?
Deadlock detected: Two transactions, Txn A: UPDATE table1 then UPDATE table2. Txn B: UPDATE table2 then UPDATE table1. A waits for B, B waits for A. Show recovery and prevention.
Classic circular deadlock: Lock order matters. Both txns must lock in same order to avoid deadlock. PostgreSQL detects and terminates one (ERROR: deadlock detected); caller retries. Recovery: (1) Immediate: Retry with backoff (app-side). for attempt in {1..5}; do pg_query && break || sleep $((2^attempt)); done. (2) Prevention: Fix lock order. Ensure all txns lock table1 before table2. Rewrite: (a) Txn A: BEGIN; UPDATE table1 ... ; UPDATE table2 ... ; COMMIT. (b) Txn B: BEGIN; UPDATE table1 ... ; UPDATE table2 ... ; COMMIT (same order). But B's "UPDATE table2" becomes unnecessary if lock already held; can use SELECT ... FOR UPDATE to pre-lock. (3) Use SERIALIZABLE isolation with explicit locks: BEGIN ISOLATION LEVEL SERIALIZABLE; LOCK TABLE table1, table2 IN ACCESS EXCLUSIVE MODE; ... COMMIT. (4) Reduce lock scope: Instead of whole table, lock only affected rows: BEGIN; UPDATE table1 SET col=val WHERE id IN (...) FOR UPDATE; UPDATE table2 SET col=val WHERE id IN (...) FOR UPDATE; COMMIT. (5) Use advisory locks for application-level coordination: SELECT pg_advisory_lock(hash); UPDATE ...; SELECT pg_advisory_unlock(hash). Monitoring: Set deadlock_timeout: ALTER SYSTEM SET deadlock_timeout = '1s' (time before lock conflict checked for deadlock; default 1s is fine). Logs show deadlock events; check regularly for patterns. Best practice: Design app transactions to lock resources in consistent order; document in code comments.
Follow-up: Explain deadlock_timeout. Why not set it very low?
SELECT queries (reads-only) are blocking each other. No deadlock. pg_stat_activity shows all in 'active' state but wait_event suggests I/O. Root cause and fix.
SELECT alone doesn't lock (only reads), but concurrent UPDATE/DELETE can block. If multiple SELECTs block each other, suspect: (1) Implicit locks from FOREIGN KEY constraints: SELECT * FROM orders WHERE customer_id = X (SELECT on orders) + concurrent UPDATE customers (updates FK in orders) can conflict. Ensure FK indexes: CREATE INDEX idx_orders_customer_id ON orders(customer_id). (2) I/O contention, not locks. All queries waiting on disk I/O. Check: SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type='IO'. If many, increase I/O capacity or parallelize reads. (3) Serialization anomalies in SERIALIZABLE mode: SHOW transaction_isolation. If 'serializable', reads can conflict with writes to rows matching phantom patterns. Switch to READ COMMITTED: ALTER SYSTEM SET transaction_isolation = 'read committed'; SELECT pg_reload_conf(). (4) Shared buffer contention: Many readers evicting buffer pages. Increase shared_buffers: ALTER SYSTEM SET shared_buffers = '32GB' (requires restart). (5) Long-running txns block VACUUM, holding xmin; readers see old pages, contend on buffers. Kill idle: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle in transaction'. Root: Likely I/O or VACUUM, not actual lock blocks. Verify: SELECT COUNT(*) FROM pg_locks WHERE NOT granted AND locktype='tuple'; if 0, no actual locks. Solution: Upgrade disk I/O (SSD), parallelize reads (read replicas), or optimize query (add index, partition).
Follow-up: Explain phantom reads in SERIALIZABLE. How does SERIALIZABLE prevent them?
SELECT ... FOR UPDATE queries on a high-throughput queue table cause serialization bottleneck. Throughput capped at 100 txns/sec despite 8 CPU cores. Root cause and optimization.
FOR UPDATE locks rows; many txns competing for same queue rows serialize. Bottleneck because lock queue is FIFO. Each dequeue: SELECT ... FOR UPDATE ... LIMIT 1 WHERE status='pending' locks one row; next txn waits. Solution: (1) Use SKIP LOCKED: SELECT * FROM queue WHERE status='pending' FOR UPDATE SKIP LOCKED LIMIT 1. Skip locked rows; each txn picks next available without waiting. Increases throughput dramatically (parallel work on different rows). (2) Partition queue: CREATE TABLE queue (id SERIAL, worker_id INT, status TEXT) PARTITION BY HASH(worker_id). Each worker dequeues from own partition (partition_id = worker_id), no cross-partition lock contention. (3) Use advisory locks: SELECT pg_advisory_lock(hash(id)) FOR UPDATE NOWAIT; UPDATE queue SET status='processing' WHERE id=picked_id. Advisory locks are lighter than row locks. (4) Sharded queue across multiple tables: queue_1, queue_2, ..., queue_8. Distribute by hash(task_id) % 8. Each app worker handles subset of queues. (5) Denormalize: Instead of SELECT+UPDATE in txn, use trigger-based updates or background worker: DELETE from queue returning id, then background updates. Testing: Measure throughput with SKIP LOCKED: time for i in {1..10000}; do psql -c "SELECT * FROM queue FOR UPDATE SKIP LOCKED LIMIT 1"; done. Expect 1000+ txns/sec vs. 100 without SKIP LOCKED. Recommended: SKIP LOCKED for queue tables is production standard.
Follow-up: Explain SKIP LOCKED vs. NOWAIT. When would you use each?
Row-level locking slow for bulk UPDATE: UPDATE table SET col=val WHERE id IN (list_of_1000_ids) locks 1000 rows, causing cascade of blocks. Optimize locking strategy.
Bulk updates lock multiple rows; others wait. Lock escalation strategy: (1) Lock table once, not rows: BEGIN; LOCK TABLE table IN ACCESS EXCLUSIVE MODE; UPDATE table SET col=val WHERE id IN (...); COMMIT. Downside: Blocks all queries (even SELECT). (2) Batch updates in small chunks: FOR i IN 1..1000 BY 100; BEGIN; UPDATE table SET col=val WHERE id IN (list[i..i+99]); COMMIT; END. Each batch locks 100 rows, shorter duration. (3) Use prepared statement inside txn to minimize lock time: BEGIN; PREPARE stmt AS UPDATE table SET col=$1 WHERE id=$2; EXECUTE stmt('val', 123); ... COMMIT (faster execution = shorter locks). (4) Pre-sort IDs by table order (physical row order) to reduce lock conflicts: Sort IDs, update in order to match page layout (reduces page fetches). (5) Defer to off-peak: Schedule bulk updates during maintenance window (low concurrency). (6) Use UNLOGGED table for intermediate: CREATE UNLOGGED TABLE updates_staging (id INT, col TEXT); INSERT INTO updates_staging VALUES (...); UPDATE table SET col=staging.col FROM updates_staging WHERE table.id=staging.id; DROP TABLE updates_staging. Faster but not crash-safe. (7) Parallel workers: ALTER TABLE table SET (parallel_workers = 4); UPDATE table SET col=val WHERE id IN (...); updates use parallel workers, still lock management. Best: Batch + off-peak + monitoring. Measure: SELECT COUNT(*) FROM pg_locks WHERE NOT granted AND locktype='tuple' to see lock queue depth.
Follow-up: Explain lock escalation in PostgreSQL. Why doesn't PG escalate to table-level locks automatically?
Transaction holds row lock indefinitely (but not an explicit lock hold; txn still running). Downstream app can't proceed. Diagnose: How old is the txn? What lock is it holding?
Indefinite lock often means app bug (infinite loop, stalled I/O, or forgotten COMMIT). Diagnose: (1) Check txn age and lock: SELECT pid, usename, xmin, state, query, query_start, NOW() - query_start AS elapsed FROM pg_stat_activity WHERE pid=. (2) If query_start old (hours), txn stalled mid-query. (3) Check wait_event: SELECT wait_event_type, wait_event FROM pg_stat_activity WHERE pid=. If 'IO' or 'BufferPin', disk I/O slow. (4) Check for infinite loops: Look at query; if SELECT/INSERT/UPDATE/DELETE without LIMIT, may be stuck. (5) Locks table shows: locktype (e.g., relation, tuple), relation (table name if locktype=relation), granted (true if acquired). Solution: (1) If app hung/stuck: Kill txn: SELECT pg_terminate_backend(. (2) Before killing, log query for debugging: psql -t -c "SELECT query FROM pg_stat_activity WHERE pid=. (3) Investigate root cause: Network timeout? App crash? Slow query? (4) Add monitoring: ALTER SYSTEM SET log_statement = 'all'; ALTER SYSTEM SET log_min_duration_statement = 60000 (log statements >60s). Recovery: Restart app cleanly; re-run query. Prevention: (1) Set statement_timeout: ALTER SYSTEM SET statement_timeout = '5min' (auto-kill queries >5min). (2) Set idle_in_transaction_session_timeout: ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min' (auto-kill idle txns). (3) App-side timeouts: Set connection timeouts in app driver.
Follow-up: Explain statement_timeout vs. idle_in_transaction_session_timeout. When does each trigger?