PostgreSQL Interview — JSONB and Document Storage Patterns

PostgreSQL JSONB vs MongoDB: Which for flexible schema (product metadata, user attributes)? Query performance, scalability, transactions, cost trade-offs.

JSONB (PostgreSQL): Flexible schema stored as binary JSON. Supports indexing, queries, ACID txns. Setup: CREATE TABLE products (id SERIAL, metadata JSONB); CREATE INDEX idx_metadata_gin ON products USING GIN(metadata). Query: SELECT * FROM products WHERE metadata->>'category'='electronics' AND (metadata->'price')::NUMERIC > 100. Pros: ACID txns, strong consistency, JSONB indexing (fast). Cons: Not schemaless (still relational table); less flexible than document DB. MongoDB: Document-oriented, truly schemaless. Query: db.products.find({metadata.category:'electronics', 'metadata.price':{$gt:100}}). Pros: Unlimited flexibility, horizontal scaling (sharding), schemaless. Cons: Eventual consistency (replica lag), weaker txns (needs compensating xns), no ACID. Comparison: Use JSONB if: (1) Structured data with optional fields (products, user profiles). (2) ACID required (payments, orders). (3) Single-node or HA replicas (consistency > scalability). (4) Cost-effective (PostgreSQL < MongoDB cloud). Use MongoDB if: (1) Truly unstructured (events, logs). (2) Horizontal scale-out >100GB. (3) Write-heavy with eventual consistency OK. (4) Developer agility > consistency. Hybrid: Use JSONB for core business data (orders, payments); MongoDB for flexible logging, analytics. Recommendation: JSONB for SaaS (consistency + cost); MongoDB for data platforms (scale + flexibility).

Follow-up: Explain JSONB vs. JSON data types. When use text vs binary?

JSONB query slow: SELECT WHERE metadata->'tags' @> '["urgent"]' on 100M rows. Index shows idx_scan=0 (not used). Why index not working? How to make it fast?

JSONB queries can bypass indexes if query pattern doesn't match index type. Issue: GIN index on metadata (whole object) may not optimize queries on nested paths. Check: EXPLAIN SELECT * FROM products WHERE metadata->'tags' @> '["urgent"]'. If "Seq Scan" despite index, reasons: (1) GIN index on metadata, not metadata->'tags'. Fix: CREATE INDEX idx_tags_gin ON products USING GIN((metadata->'tags')) (index on extracted path). (2) Index not being used due to cost model. Lower random_page_cost: ALTER SYSTEM SET random_page_cost = 1.1 (tell planner GIN fast). (3) Planner estimates index slower than seq scan for 100M rows. If @> returns many rows (e.g., 50% of table), seq scan actually faster. (4) Use jsonb_path_ops for faster lookups: CREATE INDEX idx_tags_path ON products USING GIN(metadata, jsonb_path_ops). Faster for @>, @< operators. Optimization: (1) Create index on frequent path: CREATE INDEX idx_metadata_category ON products USING GIN((metadata->'category')). (2) Use computed column: ALTER TABLE products ADD COLUMN tags TEXT[] GENERATED ALWAYS AS (jsonb_array_elements_text(metadata->'tags')) STORED; CREATE INDEX idx_tags_array ON products USING GIN(tags). Array index faster than JSONB. (3) Denormalize: Extract tags to separate table: CREATE TABLE product_tags (product_id INT, tag TEXT); CREATE INDEX idx_tag ON product_tags(tag). Query becomes JOIN. (4) Rewrite query: SELECT * FROM products WHERE metadata @> '{"tags":["urgent"]}' (full object) may use index better than nested path. Test with EXPLAIN. Tuning: EXPLAIN (ANALYZE) SELECT ... should show "Index Scan using idx_tags_gin". If still "Seq Scan", re-create index or collect stats: ANALYZE products.

Follow-up: Explain jsonb_path_ops vs. standard GIN. When is each better?

JSONB data model: Storing user profiles with nested arrays (preferences: [item1, item2]). Schema: normalized (separate tables) vs. denormalized (JSONB). Query patterns and trade-offs.

Schema design choice: Normalize vs. Denormalize in JSONB. Normalized (3NF): CREATE TABLE users (id SERIAL, name TEXT); CREATE TABLE user_preferences (user_id INT, preference TEXT). Queries: SELECT u.*, p.preference FROM users u LEFT JOIN user_preferences p ON u.id=p.user_id WHERE u.id=123. Pros: Clean, flexible (add/remove preferences). Cons: JOIN overhead, NULL handling complex. Denormalized (JSONB): CREATE TABLE users (id SERIAL, name TEXT, preferences JSONB). Query: SELECT * FROM users WHERE preferences @> '["item1"]'. Pros: Single row fetch (no JOIN), fast. Cons: Inflexible schema (requires app migration), duplicate data (if preferences shared across users). Recommendation by query pattern: (1) Few preferences per user, changes rare: JSONB (denormalized). Typical: SELECT id, name, (preferences->'email_notifications')::BOOLEAN as notify FROM users WHERE id=123. (2) Many preferences, frequent changes, shared across users: Normalized. Typical: SELECT u.name, p.preference FROM users u JOIN user_preferences p ON u.id=p.user_id WHERE u.id=123. (3) Hybrid: Store immutable preferences in JSONB (onboarding choices), mutable in separate table (settings). Mixed: CREATE TABLE users (id SERIAL, static_prefs JSONB); CREATE TABLE user_settings (user_id INT, setting TEXT, value TEXT). Hybrid avoids re-scanning JSONB for frequent updates. Performance tuning: (1) If accessing nested keys frequently: CREATE INDEX idx_email_notif ON users USING GIN((preferences->'notifications'->'email')). (2) If filtering by multiple keys: Use expression index: CREATE INDEX idx_prefs_multi ON users USING GIN(preferences jsonb_path_ops). (3) Monitor: EXPLAIN (ANALYZE) SELECT ...WHERE preferences @> ...`. Should use index, not seq scan. Verdict: Start denormalized (JSONB) for MVP; normalize at scale if performance issues or querying patterns warrant.


  Follow-up: Explain pros/cons of storing arrays in JSONB vs. separate table. When should you normalize?



  JSONB contains nested objects (events with metadata). Query: find events where metadata.source='api' AND metadata.severity > 8. Multiple nested keys slow down. Design fast query strategy.
  Nested JSONB queries can be slow if not indexed properly. Example: SELECT * FROM events WHERE (metadata->'source')::TEXT='api' AND (metadata->'severity')::INT > 8. Without index, seq scan. Optimize: (1) Create index on each path: CREATE INDEX idx_source ON events USING GIN((metadata->'source')); CREATE INDEX idx_severity ON events USING GIN((metadata->'severity')). Planner may use both indexes (index union). (2) Better: Multi-field index: CREATE INDEX idx_source_sev ON events USING GIN(metadata jsonb_path_ops) (covers all nested paths). (3) For range queries (severity > 8): GIN less optimal (not sorted). Use computed column: ALTER TABLE events ADD COLUMN severity_num INT GENERATED ALWAYS AS ((metadata->>'severity')::INT) STORED; CREATE INDEX idx_severity_btree ON events(severity_num). B-tree handles range efficiently. (4) Combine indexes: CREATE INDEX idx_api_events ON events USING GIN(metadata jsonb_path_ops) WHERE (metadata->>'source')='api' (partial index for common filter). Queries: (a) SELECT * FROM events WHERE metadata @> '{"source":"api"}' AND severity_num > 8 uses both indexes. (b) SELECT * FROM events WHERE metadata->>'source'='api' AND (metadata->>'severity')::INT > 8 uses both if indexes present. Tuning: (1) EXPLAIN shows "Bitmap Index Scan" or "Index Scan using idx_source_sev". If "Seq Scan", re-index or update stats: ANALYZE events. (2) Use jsonb_each_text for complex queries: WITH event_pairs AS (SELECT id, jsonb_each_text(metadata) AS (key, val) FROM events) SELECT * FROM event_pairs WHERE key='source' AND val='api' (slower, use only for debugging). Recommendation: Extract frequently queried fields to computed columns with B-tree indexes; JSONB for sparse/variable fields.
  Follow-up: Explain generated columns (STORED vs. VIRTUAL). When use each with JSONB?


  JSONB bulk update: UPDATE users SET preferences = preferences || '{"theme":"dark"}'::JSONB WHERE id IN (...). 1M rows slow. Optimize for bulk JSONB updates.
  Bulk JSONB updates can lock heavily or cause excessive I/O. Query: UPDATE users SET preferences = preferences || '{"theme":"dark"}'::JSONB WHERE id IN (SELECT id FROM users WHERE preferences->>'language'='en' LIMIT 1000000) scans 1M rows, updates each (slow). Optimization: (1) Batch into chunks: FOR i IN 1..100; DO UPDATE users SET preferences = ... WHERE id BETWEEN i*10000 AND i*10000+9999; COMMIT; END. Smaller txn locks shorter duration. (2) Use parallel workers: ALTER TABLE users SET (parallel_workers = 4); UPDATE users SET preferences = ... WHERE .... Limited benefit for UPDATE (locking issue). (3) Re-index approach: (a) Create temp table: CREATE TABLE users_new AS SELECT id, preferences || '{"theme":"dark"}'::JSONB as preferences FROM users WHERE .... (b) Swap: ALTER TABLE users RENAME TO users_old; ALTER TABLE users_new RENAME TO users; CREATE INDEX idx_prefs ON users USING GIN(preferences). (c) Drop old: DROP TABLE users_old. (4) Partial update (if only some fields): UPDATE users SET preferences = jsonb_set(preferences, '{theme}', '"dark"') WHERE ... (updates single key, faster). (5) Denormalize to column: If theme is frequently updated: ALTER TABLE users ADD COLUMN theme TEXT; UPDATE users SET theme = 'dark' WHERE ... (B-tree faster than JSONB). Then sync: UPDATE users SET preferences = jsonb_set(preferences, '{theme}', theme::JSONB) periodically. Monitoring: (1) Lock contention: SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event_type='Lock'. (2) I/O: iostat -x 1 5 | grep sda; if high await, I/O bottleneck. Solution: Partition by theme or use sharding. Recommendation: For bulk updates, batch in 10k-100k chunks; denormalize frequently-updated fields to columns.
  Follow-up: Explain jsonb_set and jsonb_insert. When use each for partial updates?