PostgreSQL Interview — Memory Configuration and Shared Buffers

PostgreSQL 1GB shared_buffers on 64GB server. Is that right? Tuning strategy: set shared_buffers, effective_cache_size, maintenance_work_mem, work_mem for optimal performance.

Shared_buffers is PG's in-process buffer pool (like Oracle SGA). Too low = frequent disk I/O; too high = memory wasted (OS cache better). Tuning rule: (1) For OLTP (frequent reads/writes): 25% of RAM. 64GB -> shared_buffers = 16GB. (2) For OLAP (bulk scans): 10-15% of RAM (larger working set). (3) For small servers (<4GB RAM): 25%. Current 1GB too low (on 64GB). Adjust: ALTER SYSTEM SET shared_buffers = '16GB'; SELECT pg_reload_conf(); RESTART. (requires restart). (2) Effective_cache_size: Tells planner about OS page cache. Set to ~80% of total RAM (excludes PG overhead + OS): ALTER SYSTEM SET effective_cache_size = '50GB' (64GB × 0.75, accounting for overhead). This tells planner indexes are likely cached; encourages index use. (3) Work_mem: Per-sort/hash buffer (multiplied by concurrency): ALTER SYSTEM SET work_mem = '512MB' (if 32 concurrent txns: 32 × 512MB = 16GB max). Start conservative (1-4% RAM / max_connections). (4) Maintenance_work_mem: For VACUUM, CREATE INDEX: ALTER SYSTEM SET maintenance_work_mem = '4GB' (10% of RAM). Verification: SHOW shared_buffers; SHOW effective_cache_size; SHOW work_mem; SHOW maintenance_work_mem. After restart, monitor: SELECT * FROM pg_stat_bgwriter shows cache hit ratio. If buffers_clean > heap_blks_read (lots of cleaning), tuning working. If heap_blks_read high despite shared_buffers, OS cache not helping; may be I/O-bound (disk upgrade needed). Best practice for 64GB: shared_buffers=16GB, effective_cache_size=50GB, work_mem=512MB, maintenance_work_mem=4GB. Tuning is system-specific; adjust based on workload and monitoring.

Follow-up: Explain shared_buffers vs. OS page cache. Why not set shared_buffers=64GB?

Shared_buffers increased from 1GB to 16GB. Restart required. During restart, app connection refused. Design graceful restart without app downtime.

PG config changes affecting memory (shared_buffers, etc.) require full restart (not pg_reload_conf). Restart disconnects all clients. Strategies for zero-downtime: (1) Primary + standby failover: (a) Use streaming replication (PG14+). (b) Before restart: Promote standby to primary: systemctl stop postgresql (on standby); pg_ctl promote (standby); systemctl start postgresql (primary as new standby); pg_basebackup ... and connect (primary). (c) Restart: Now restart old primary. (d) Failback: Promote old primary again. Downtime: Failover time (~1 second for DNS + routing). (2) Connection pool (PgBouncer) with multiple backends: (a) Configure PgBouncer to connect to multiple PG instances. (b) Drain current PG instance: PAUSE; WAIT 10 (pause new connections, wait for in-flight to complete). (c) Restart PG. (d) Resume: RESUME. Downtime: None for app (pool routes to other instance, not connection loss). (3) Rolling restart with app coordination: (a) Announce maintenance window. (b) App enables connection retry logic (automatic). (c) Restart PG (clients get connection refused). (d) App retries, connects to restarted PG. Downtime: Milliseconds to seconds (app retry latency). Setup: (1) Add PgBouncer tier: apt-get install pgbouncer; edit /etc/pgbouncer/pgbouncer.ini [databases] mydb = host=pg1,pg2,pg3 port=5432. (2) App connects to PgBouncer (port 6432), not PG directly. (3) PgBouncer multiplexes to PG backends. Restart old PG without app impact. (4) Restart secondary PG instance on-demand (less traffic). Recommended: PgBouncer + rolling restart is production-standard. Zero-downtime requires setup (failover or pooling) beforehand.

Follow-up: Explain PgBouncer and how it enables zero-downtime PG restarts. Pool modes and trade-offs?

Shared_buffers increased too much (40GB on 64GB server). OOM killer triggered; PostgreSQL crashed. Recovery and optimal tuning.

OOM (Out Of Memory) kill = severe misconfiguration. Causes: (1) shared_buffers = 40GB + effective_cache_size = 60GB + work_mem = 2GB × 100 sessions = over 64GB. (2) Missing swap; Linux allows overcommit but not OOM recovery. Fix recovery: (1) Immediate: Restart PG, lower shared_buffers: ALTER SYSTEM SET shared_buffers = '16GB'; SELECT pg_reload_conf(); systemctl restart postgresql. (2) Verify: SHOW shared_buffers; SHOW shared_buffers_actual (actual may differ if reboot not done). (3) Check memory available: free -h on Linux; should have >20GB free post-startup. Prevention: (1) Tuning formula: shared_buffers = 25% RAM, effective_cache_size = 75% RAM, work_mem = 1% RAM / max_connections. (2) Monitor: ps aux | grep "postgres:.*shared mem" | awk '{print $6}' to see per-process memory. Rough estimate: shared_buffers + (work_mem × max_parallel_workers) + overhead. (3) Test in staging first before prod tuning. (4) Set swappiness low: sysctl vm.swappiness=10 (default 60; lower = less swap use, forces mem mgmt). (5) Set panic behavior: sysctl vm.panic_on_oom=0 (allows OOM killer rather than panic; less disruptive). Calculation: 64GB server: shared_buffers = 16GB. Other processes (OS, app): ~5GB. Available for caching: 50GB. Work_mem × expected_concurrent_txns should not exceed available. If 100 concurrent queries × 512MB work_mem = 50GB (acceptable). If 512MB × 200 = 100GB (exceeds!). Reduce work_mem or connections. Recommended tuning: 64GB -> shared_buffers=16GB, work_mem=128MB (or 256MB if low concurrency), maintenance_work_mem=4GB, effective_cache_size=45GB. Conservative is safer; increase iteratively.

Follow-up: Explain Linux OOM killer. How to prevent PG from being killed?

Cache hit ratio low (65%) despite high shared_buffers. Working set larger than cache. Design strategy: improve hit ratio without buying more RAM.

Low hit ratio (target >95%) despite large buffers indicates: (1) Working set > RAM (table >64GB, queries scan all). (2) Random access patterns (no locality; cache ineffective). (3) Bad query plans (full table scans when index available). Diagnose: (1) Check hit ratio: SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) FROM pg_statio_user_tables. If <95%, investigate. (2) Find problematic tables: SELECT schemaname, tablename, heap_blks_hit, heap_blks_read, ROUND(100 * heap_blks_hit / (heap_blks_hit + heap_blks_read), 2) as hit_ratio FROM pg_statio_user_tables WHERE heap_blks_read > 0 ORDER BY hit_ratio ASC LIMIT 10. (3) Identify queries: SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10. Slow queries likely scanning full tables. Fix: (1) Add indexes: CREATE INDEX idx_orders_customer_id ON orders(customer_id). Forces index use instead of seq scan. Verify: EXPLAIN SELECT * FROM orders WHERE customer_id=123 should show IndexScan, not SeqScan. (2) Optimize query: Rewrite to use covering index or materialize. (3) Partitioning: PARTITION BY RANGE (created_at) (queries on recent data only touch relevant partitions). (4) Denormalization: Precompute aggregations in separate table to avoid large scans. (5) Caching layer (Redis): Cache frequently-accessed data outside PG; reduce DB load. Monitoring: (1) Track hit ratio trending: Export pg_stat_bgwriter periodically. (2) Alert if <95% (possible cache eviction or workload change). (3) Re-profile quarterly. Alternative: If working set >> RAM and low hit ratio unavoidable, focus on: (1) Query optimization (reduce rows scanned). (2) SSD storage (random I/O fast enough, hit ratio less critical). (3) Distributed query (sharding to reduce per-node working set). Recommendation: For most OLTP, 95%+ hit ratio achievable with proper indexing and query optimization. If stuck at 70% despite tuning, likely need more RAM or architectural change.

Follow-up: Explain hit ratio vs. throughput trade-off. Why not maximize hit ratio at all costs?