System Design Interview — Database Selection: Relational vs NoSQL vs NewSQL

You need to store 1 billion records with complex joins (users + orders + payments + reviews, typical 5-table JOIN) AND sub-millisecond P99 read latency (<=10ms). The workload is read-heavy (95% reads, 5% writes). Write throughput: 10K writes/sec. Recommend a database and justify against alternatives.

1B records + complex joins + sub-ms reads + high throughput is a strong signal for **specialized analytics or in-memory database**, not traditional OLTP. Analysis: (1) **PostgreSQL (relational)**: can handle 1B records with proper indexing (B-tree on join keys). Complex JOINs are native. But P99 latency at scale: 1B records, 5-table JOIN → ~50ms even with indexes (CPU + I/O). Sharding helps (100 shards × 10M records), but cross-shard JOINs are slow. **Not viable for <10ms SLA**. (2) **MongoDB (NoSQL)**: denormalize data into documents. Single document has users + orders + payments + reviews nested. Read is single lookup (~1ms). But writes become complex (update all nested documents when order status changes). 5% writes on 10K writes/sec = 500 writes/sec requiring multi-document transactions. MongoDB transactions across shards have overhead. Partial match: viable if you denormalize aggressively. (3) **CockroachDB (NewSQL)**: ACID + distributed SQL. Handles 1B records + JOINs well. Latency: ~5-15ms per query (depends on shard topology). If data is co-located (same shard), <5ms. If cross-shard, 10-50ms. **Viable but not guaranteed sub-10ms**. (4) **Snowflake/BigQuery (analytics)**: great for complex analytics but write latency is high (minutes for eventual consistency), read latency is ~100-500ms (not sub-10ms). Eliminated. (5) **In-memory database (Redis, Memcached)**: sub-1ms reads, but 1B records × avg 1KB = 1TB memory cost (~$50K/month on cloud). Complex JOINs require application logic, not SQL. (6) **Specialized: DuckDB or ClickHouse with caching tier**: DuckDB is in-process OLAP, <5ms for simple queries on 1B records. Can handle JOINs efficiently via columnar format. Cache hot data in Redis. Best approach: **DuckDB (primary) + PostgreSQL read replica for fallback**. DuckDB: run on each node, 10-50GB per node, handle local JOINs in <5ms. PostgreSQL: eventual consistency replica, used if DuckDB node down. Cost: 100 nodes × DuckDB = $100K hardware, or use managed ClickHouse cloud (~$50K/month for 1B records). Alternative: **CockroachDB with geo-partitioning**. Partition data by region, co-locate join keys in same shard. Guarantees <10ms if working set fits in node memory.

Follow-up: DuckDB is in-process (embedded). If you run it on 100 nodes, how do you handle cross-node JOINs and failover?

Your social network has 500M users, each user can follow other users (many-to-many relationship). Your queries: "show me followers of user X" and "show me users that X follows." Traditional SQL: SELECT * FROM follows WHERE user_id = X OR follower_id = X. Estimate query latency on PostgreSQL vs graph database (Neo4j). Make a recommendation.

Social graph is a classic graph problem. 500M users × avg 500 followers each = 250B edges. PostgreSQL query: SELECT * FROM follows WHERE user_id = X. With B-tree index on user_id, lookup is O(log 250B) + range scan = ~10ms to fetch 500 results. Latency: acceptable. But deeper queries (e.g., "mutual followers of X and Y") require JOIN: SELECT * FROM follows f1 JOIN follows f2 ON f1.follower_id = f2.user_id WHERE f1.user_id = X AND f2.user_id = Y. This scans both indexes, joins in memory = ~100-500ms. Deeper queries (e.g., "shortest path from X to Y") are extremely slow on SQL (requires recursive CTE, scans entire graph). Neo4j (graph DB): stores edges as first-class entities. Query: MATCH (user:User)-[:FOLLOWS]->(follower:User) WHERE user.id = X. Lookup is O(1) graph traversal (adjacency list access), returns 500 results in ~1ms. Mutual followers: MATCH (u1:User)-[:FOLLOWS]->(common:User)<-[:FOLLOWS]-(u2:User) WHERE u1.id = X AND u2.id = Y. Graph traversal is native, ~5ms. Shortest path: MATCH path = shortestPath((u1:User)-[:FOLLOWS*]-(u2:User)) WHERE u1.id = X AND u2.id = Y. Neo4j indexes this, ~10-100ms (depends on path length). Recommendation: **(1) for simple queries (follows of user X): PostgreSQL** is sufficient, 10ms latency. Index on user_id. (2) for graph traversal queries: **Neo4j**. 1ms for simple, 5-100ms for complex paths. Scale: Neo4j can handle 500M nodes with sharding, but overhead is high (cross-shard traversal = network hops). Cost comparison: PostgreSQL ~$5K/month (Postgres pro support) + application complexity (implement graph algorithms). Neo4j ~$50K+/month (enterprise license) but native graph queries. Hybrid approach: **PostgreSQL + caching layer**. Cache top 1K most-followed users (celebrities), their followers in Redis (1K × 500K = 500GB). Simple queries hit cache (1ms), complex queries hit PostgreSQL (100ms). Cost: $10K/month (Postgres + Redis + cache management).

Follow-up: Neo4j doesn't support complex JOINs between graph and relational tables (e.g., "followers AND their purchase history"). How do you query across both?

A time-series analytics platform receives 1M events/second from IoT sensors. Events: {sensor_id, timestamp, temperature, humidity, pressure}. Workload: real-time aggregations (avg temp by sensor per minute), historical queries (temp trends over weeks). Which database is best suited?

Time-series workload is distinct from OLTP/OLAP. 1M events/sec, append-only pattern, aggregation queries. Analysis: (1) **PostgreSQL**: write 1M rows/sec = requires 1000 partitions (each handles 1K writes/sec). Selecting 1 partition, INSERT → WAL write + index update = ~1-5ms per write. P99 write latency = 50ms+. Aggregation: SELECT AVG(temp), sensor_id FROM events WHERE timestamp > now() - 1 minute GROUP BY sensor_id. Full table scan on 60M rows = ~1-5 seconds. Not suitable. (2) **InfluxDB (time-series DB)**: designed for time-series. Write 1M events/sec via batching (1000 events/batch = 1K batches/sec). Each batch written in ~1ms. Write latency: P99 = 10ms. Aggregation: SELECT MEAN(temperature) FROM events WHERE time > now() - 1m GROUP BY sensor_id. Query engine uses columnar storage, downsamples data, returns in ~50-100ms. Suitable. (3) **TimescaleDB (PostgreSQL extension)**: PostgreSQL + time-series optimizations. Hyper-tables auto-partition by time. Write 1M/sec = handled by 100-shard hyper-table. P99 write latency = 5-10ms. Aggregation: ~100-200ms (better than PostgreSQL but slower than InfluxDB). (4) **Cassandra (NoSQL)**: handles high write throughput (1M/sec across cluster). Queries: SELECT AVG(temperature) FROM events WHERE sensor_id = 123 AND timestamp > now() - 1d. Cassandra doesn't support avg aggregation natively, requires client-side aggregation or read entire partition (slow). Not ideal. (5) **Amazon Timestream**: managed time-series DB. Auto-scales to 1M events/sec. Aggregations: ~1-5 seconds (serverless, variable latency). Recommendation: **InfluxDB for strict real-time (<100ms aggregations)**. 1M events/sec, P99 write latency <10ms, aggregation <100ms. Cost: ~$50K/month for self-hosted or $10-20K/month on cloud. Alternative: **TimescaleDB** if you want PostgreSQL ecosystem (ACID, complex queries). Latency: ~5-10ms writes, 100-200ms aggregations. Cost: ~$5K/month (Postgres with TimescaleDB extension). Trade-off: InfluxDB is faster but less flexible. TimescaleDB is flexible but slower.

Follow-up: You need to correlate temperature spikes with sensor failures (JOIN events table with sensor_status table). InfluxDB doesn't support JOINs. How do you handle this?

Your e-commerce platform stores product catalog (100K products, frequently changing attributes like price, stock). Workload: read-heavy (1M reads/sec), writes are frequent (10K updates/sec on stock/price). Strong consistency required (users must see current price before paying). Which database handles this best?

Product catalog is OLTP-heavy with consistency requirements. 100K products, 1M reads/sec, 10K writes/sec. Consistency is critical (stale price = lost revenue + trust). Analysis: (1) **PostgreSQL (relational)**: strong consistency (ACID), handles 100K products easily. Index on product_id, read latency ~1-5ms. Write latency for stock update: UPDATE products SET stock = stock - 1 WHERE product_id = 123 → row-level lock + update → ~5-10ms. At 10K writes/sec, 10K shards needed to handle load. Write latency degradation: each shard handles 1 write/sec, P99 = 100ms (queue buildup). Partial success. (2) **MongoDB (NoSQL)**: document per product. Read: db.products.findOne({_id: 123}) → ~1-3ms. Write: db.products.updateOne({_id: 123}, {$set: {stock: X}}) → ~5-10ms with w:"majority". Consistency: weaker than PostgreSQL (eventual consistency by default, but w:"majority" adds latency). Writes now P99 = 50-100ms. Scalable via sharding, but consistency guarantees still weaker. (3) **DynamoDB (NoSQL serverless)**: auto-scales reads/writes. Consistent read = 1 strong consistency read per request. 1M reads/sec = 1M consistency consumed. Cost: $500K/month (!)—too expensive. Inconsistent reads cheaper but loses consistency requirement. Eliminated. (4) **Cockroach DB (NewSQL)**: ACID + distributed. Strong consistency (serializable isolation). 100K products → distribute across 10 regions. Read: local read if product in same region = ~1-3ms. If cross-region, 10-50ms. Write: UPDATE with SERIALIZABLE isolation = consensus across majority quorum = 10-30ms latency. At 10K writes/sec, scales to 100 nodes × 100 writes/sec/node = 10K writes/sec. P99 write latency = 50-100ms (quorum overhead). Suitable. (5) **Redis (in-memory)**: product cache + DB fallback. Redis: 1M reads/sec, P99 <1ms. Writes: INCR or SET → <1ms. But Redis is single-threaded (Lua scripting for atomic updates). At 10K writes/sec, queue builds up. P99 write = 50-100ms. Not suitable as primary. Recommendation: **PostgreSQL with read replicas**. PostgreSQL primary handles 10K writes/sec (10 shards). Read replicas (20 nodes) handle 1M reads/sec (50K reads/sec per node). Consistency: reads hit replicas (eventual consistency ~100ms lag), but stock check before purchase hits primary (strong consistency). Cost: ~$20K/month (hardware + managed service). Alternative: **CockroachDB** if you want true multi-region strong consistency. Latency trade-off: 50-100ms write (vs 10ms on PostgreSQL), but guaranteed strong consistency everywhere. Cost: $50K/month.

Follow-up: A user checks stock (reads replica, sees 100 units). Then user buys item. Stock check hits primary (sees 50 units). Purchase proceeds but inventory goes negative. How do you prevent?

You're building a multi-tenant SaaS platform. 10K customers, each with their own data (users, projects, tasks). Strong isolation required (customer A can't see customer B's data). Low-cost hosting. Which database architecture minimizes infrastructure while ensuring isolation?

Multi-tenant SaaS has two extreme architectures: (1) **Database-per-tenant**: each customer has own database. 10K DBs. Isolation: perfect (separate databases). Scaling: each DB can be tuned independently. Cost: 10K × ($100/month managed DB) = $1M/month. Not viable. (2) **Shared database with row-level security**: 1 database, 1 table for users (across all customers). Row-level security (RLS) adds tenant_id filter to every query. Isolation: perfect (enforced by DB). Scaling: single database, can handle 10K tenants. Cost: 1 DB × $500/month = $500/month. Viable. (3) **Schema-per-tenant**: each tenant has own schema (but same PostgreSQL instance). CREATE SCHEMA tenant_123; CREATE TABLE users ... IN SCHEMA tenant_123. Isolation: schema-level (access controlled by DB role). Scaling: works for 100-1000 tenants per instance, then split to multiple instances. Cost: lower than database-per-tenant, higher than shared DB. ~10 instances × $500/month = $5K/month. (4) **Application-level isolation**: 1 database, all tables shared. Application code adds WHERE tenant_id = X filter. Isolation: not enforced by DB (risk: developer forgets filter, data leak). Cost: lowest. Viable if QA is strong. Recommendation: **(2) shared database + row-level security**. Implementation: PostgreSQL with RLS policies. Policy: CREATE POLICY tenant_policy ON users AS RESTRICTIVE TO authenticated USING (tenant_id = current_tenant_id()). Every query on users automatically filtered by tenant_id. Developers don't need to remember filters. Cost: ~$500/month for managed DB supporting 10K tenants. Isolation: perfect (DB enforced). Trade-off: RLS overhead ~5-10% query latency (WHERE clause added), but acceptable. If isolation perfection needed: **(3) schema-per-tenant** with periodic consolidation (10K tenants → 100 schemas × 100 instances). Cost: $5K/month. Isolation: schema-level. Scaling: can split to new instances as needed.

Follow-up: One customer (tenant 123) runs aggregation query that scans entire table (SELECT COUNT(*) FROM users). RLS filter is applied, but query scans all rows, then filters. Latency: 30 seconds. How do you optimize?

You have an analytics use case: 500GB raw data (historical events), 1M new events/day added. Queries: complex aggregations across 100M rows, typically run at night (batch). Response time: <5 minutes acceptable. Which database minimizes costs while meeting SLA?

Analytics workload: large data volume, complex aggregations, batch queries, cost-sensitive. 500GB + append-only. Solutions: (1) **PostgreSQL**: can handle 500GB with tuning. Aggregation query on 100M rows: GROUP BY, aggregate functions, index on group column. Query time: 10-60 seconds (depends on aggregation complexity). Scale up or use partitioning to improve. Cost: managed service ~$5K/month (500GB data). Viable if workload is simple aggregations. (2) **Snowflake**: cloud analytics optimized. 500GB columnar storage = ~50GB compressed. Query on 100M rows: SELECT COUNT(*), SUM(amount) GROUP BY customer_id. Query time: 1-5 seconds (columnar + parallel execution). Cost: $500/month compute + $100/month storage = $600/month. Viable and cost-effective. (3) **BigQuery**: similar to Snowflake. 500GB data = $2.5/month storage (standard). Queries: $6.25 per TB scanned. Aggregation scans entire 500GB = $3.13 per query. 1 query/night × 30 days = $94/month. Total ~$100/month. Most cost-effective. (4) **Data warehouse appliance (Teradata)**: specialized hardware, $50K+ setup cost. Not for analytics use case. (5) **DuckDB (local)**: in-process, no managed service cost. Download 500GB data daily, run aggregations locally. Query time: <10 seconds (in-process, no network latency). Cost: free software, but compute machines needed. If running on 1 m5.2xlarge instance 8 hours/day = ~$500/month. (6) **Parquet files on S3 + Athena**: store 500GB as Parquet (compressed to 50GB). Queries run on Athena (serverless). Query cost: $5 per TB scanned = $2.50 per query. 1 query/night = $75/month. Cost: ~$50/month storage + $75/month queries = $125/month. Recommendation: **BigQuery** for simplicity and cost. Cost: ~$100/month. Query time: 1-5 minutes (meets <5 minute SLA). Alternative: **Athena + S3** for lower cost (if you have AWS lock-in). Cost: ~$125/month. Query time: 1-5 minutes. If compute cost is concern: **DuckDB on spot instance**. Cost: ~$200/month (spot instance). Query time: <10 seconds. Trade-off: DuckDB requires DevOps (instance management), BigQuery is serverless (simpler).

Follow-up: Your aggregation query runs nightly but takes 8 minutes (> 5 min SLA). BigQuery scans 500GB. Can you optimize without upgrading?

A financial trading system needs to store order books (bid/ask price levels, continuously updated). 100K trading pairs, each with 50 price levels (25 bids + 25 asks). Updates: 1K updates/sec per trading pair (50M updates/sec total). Latency SLA: orders matched <100ms. Which database can handle this?

Order book is extreme write throughput + very low latency + complex query pattern. 100K pairs × 50 levels = 5M rows. 50M updates/sec (!). Latency: <100ms match time. Solutions: (1) **PostgreSQL**: 50M updates/sec → requires 50K shards (each handles 1K updates/sec). At this scale, coordination overhead kills performance. Update latency: 100-500ms+. Not viable. (2) **MongoDB**: similar sharding limitation. Latency: 50-200ms per update. Partial: only if matched orders are rare (most updates are fast inserts). (3) **Redis (in-memory)**: all data in memory. ZADD to sorted set (bid levels), ZRANGE to query levels. 50M updates/sec → Redis can handle ~1M commands/sec on single instance. With Redis cluster (100 nodes), 100M commands/sec, should cover 50M updates/sec. Latency: <10ms per update. Suitable. But volatility: order book resets on crash (data loss). Solution: Redis + AOF persistence or Redis + replication. Cost: $100K+/month for 100-node Redis cluster. (4) **Specialized: MemSQL or VoltDB**: in-memory databases optimized for OLTP. VoltDB handles 1M transactions/sec per node. 100 nodes = 100M transactions/sec. 50M updates/sec is achievable. Latency: <50ms per order match. Cost: $200K+/month license + infrastructure. (5) **DuckDB OLTP**: embedded in-process OLTP DB, not designed for 50M updates/sec. Latency: >1 second. Not viable. (6) **Hybrid: Redis (hot data) + PostgreSQL (backup)**: order book (active price levels) in Redis. Matched orders persisted to PostgreSQL async. Read: Redis <10ms. Write: Redis immediate, async persist to PostgreSQL. Cost: $10K/month (Redis) + $5K/month (PostgreSQL). Latency: <100ms for order matching (Redis), eventual persistence (PostgreSQL). Recommendation: **Redis cluster with replication** for <100ms latency requirement. If persistence is critical: **Redis + Redis Sentinel for failover**. Cost: ~$15K/month. Latency: <10ms order match. If extreme scale (100M updates/sec): **VoltDB** or **MemSQL**. Cost: $200K+/month. Latency: <50ms.

Follow-up: Redis cluster has node failure. 10% of order book data is unavailable for 30 seconds (failover time). Matched orders are lost. How do you prevent data loss?

You're designing a document management system (like Confluence): millions of documents, each with edit history (versions). Queries: "fetch doc version X" (frequent), "list all versions of doc Y" (occasional), "full-text search across doc history" (rare but critical). Which database minimizes complexity and cost?

Document management with versioning and full-text search requires: (1) fast point reads (fetch version X), (2) list queries (versions of doc), (3) full-text indexing. Analysis: (1) **MongoDB**: store documents as { doc_id, version, content, created_at, edited_by, full_text }. Point read: db.documents.findOne({doc_id, version}) → ~1-3ms. List versions: db.documents.find({doc_id}) → ~10-50ms (depending on number of versions). Full-text search: index content field, query → ~100-500ms (scans all versions). Scale: sharding on doc_id. Cost: moderate (~$10K/month for 1M docs with history). (2) **PostgreSQL**: denormalized table (doc_id, version, content, created_at, full_text_vector). Point read: SELECT * WHERE doc_id = X AND version = Y → index lookup, ~1-3ms. List versions: SELECT * WHERE doc_id = X ORDER BY version DESC → range scan, ~10-50ms. Full-text search: SELECT * WHERE to_tsvector(full_text) @@ query → uses GiST index, ~50-200ms. Scale: sharding on doc_id. Cost: lower than MongoDB (~$5K/month). (3) **Elasticsearch**: denormalized documents with nested versions array. Point read: GET /docs/_doc/doc_123 with script to extract version X → ~5-10ms (Lucene index overhead). List versions: agg query → ~50-100ms. Full-text search: query DSL → ~50-100ms (native full-text, better than SQL). Scale: sharding via Elasticsearch cluster (indices per doc_id). Cost: ~$15K/month (self-hosted or cloud). (4) **Versioned blob storage (S3 + DynamoDB metadata)**: store document blob in S3 (version = object version tag). Metadata in DynamoDB: { doc_id, version, s3_key, created_at }. Point read: fetch from DynamoDB (~5-10ms), then S3 GET (~50-100ms total). Full-text search: requires separate index (e.g., Elasticsearch ingestion pipeline). Cost: very cheap (~$2K/month S3 + DynamoDB), but complexity (indexing pipeline). Recommendation: **PostgreSQL with full-text search** for simplicity. Fast point/list queries, built-in full-text with GiST index. Single database, no external dependencies. Cost: ~$5K/month. Alternative: **Elasticsearch** if full-text performance is critical (interactive search). Cost: ~$15K/month but better search UX. Hybrid: **PostgreSQL primary + Elasticsearch index** via Logstash. Writes update PostgreSQL, async sync to Elasticsearch. Cost: ~$8K/month, best of both worlds.

Follow-up: A user frequently switches between document versions (clicking "view version 5" then "view version 12"). Should you cache versions, or is the index lookup fast enough?