System Design Interview — Horizontal vs Vertical Scaling Tradeoffs

Your database (PostgreSQL) runs on a 192-core, 3TB-RAM instance (aws.x2iezn.6xlarge). CPU is at 90%, memory at 45%, disk I/O at 65%. Query latency is 200ms (p99). Your director says "just buy a bigger instance"—there is no bigger instance in AWS. What now?

The Hard Limit: You've hit hardware scaling ceiling. AWS's largest instance is x2iezn.32xlarge (1,024 cores), but AWS won't sell you that for databases—you'd get rejected or throttled. More importantly, 90% CPU on single instance means the workload itself is poorly structured, not under-provisioned.

Real Solution: Horizontal Sharding: (1) Profile queries: 80% of CPU is coming from 3 expensive queries (probably full table scans, missing indexes). Optimize: add composite indexes, rewrite queries to use partition pruning. (2) Shard by tenant_id or user_id: split tables across 4-8 PostgreSQL instances (each 96 cores instead of 192). Each instance handles ~25% of data. Query latency drops because each shard is now breezy (25% CPU, 11% memory). (3) Implement query routing layer: application sends queries to correct shard via shard key hash. Use citus (PostgreSQL extension) for automatic sharding if schema is complex. (4) Read replicas: replicate each shard to 2 read replicas. Distribute reads 70/30 write/read.

Why Not Vertical? Vertical scaling hits hard limits (physics, AWS constraints). Horizontal scales indefinitely. 1000-core instance = 90% CPU because single-threaded bottleneck exists. 100-core shards = 18% CPU per shard = can add more shards linearly.

Result: 4x throughput. Latency drops to 50ms p99. CPU per instance falls to 22%.

Follow-up: After sharding, you notice shard #2 (containing your biggest customer's data) is at 88% CPU while other shards idle at 12%. How do you handle this hot shard?

You're designing a payment processing system. Transaction throughput: 10K tx/sec. Consistency requirement: serializable (ACID-strict). Database latency p99 must be <50ms. Should you use a single PostgreSQL with huge instance, or distributed system like CockroachDB?

Analysis: 10K tx/sec at 50ms p99 latency requires 500K transaction capacity per second (10K / 0.02 = 500K ops needed). Single PostgreSQL server, even massive, can handle ~5K-8K complex transactions/sec due to ACID overhead (lock contention, write-ahead log sync). You'd need 1.25-2x overprovisioning.

Vertical Approach (Single Big Instance): Buy x2iezn.24xlarge (768 cores, 2.3TB RAM). Cost: $15-20K/month. Advantage: simple, no distributed transaction complexity. Disadvantage: single point of failure, impossible to upgrade without downtime, hitting bottleneck soon (10K tx/sec is still 20% of your max capacity—no runway).

Horizontal Approach (CockroachDB/Spanner model): Deploy 3 CockroachDB nodes on 96-core instances. Each node handles ~4K tx/sec independently. Distributed consensus (Raft) coordinates serializable transactions across nodes. Cost: 3 × $4K/month = $12K/month. Advantage: horizontal scaling (add nodes linearly), built-in failover, geo-distribution possible. Disadvantage: cross-node transactions have network latency (5-10ms added), slightly more operational complexity.

Hybrid Decision: Start with vertical (single big instance + read replicas for reads). When hitting ceiling (~8K tx/sec), you've bought 18 months of runway. Then migrate to horizontal sharding (shard by account_id, route transactions to correct shard). This delays operational complexity and distributed systems headaches.

The Math: Single instance can handle 8K tx/sec. Distributed system can handle 10K+ but adds 10ms latency per cross-node transaction (15% of your 50ms p99 budget). For payment systems, prefer vertical until you absolutely must go horizontal.

Follow-up: Your biggest customer (30% of transaction volume) needs global transaction processing (transfer money from Brazil account to Japan account atomically). How do you handle cross-shard transactions without sacrificing latency?

You operate a user recommendation engine. It processes 100M feature vectors (embedding dimension 768) for personalization. This data doesn't fit in RAM on any single instance (768 × 100M × 4 bytes = 300GB). Inference latency must be <10ms. Should you scale vertically (larger instance) or horizontally (vector database cluster)?

The Physics Problem: 300GB doesn't fit in memory on any AWS instance cheaper than $200K/month (we're talking 12TB+ memory = enterprise tier). Even if it did, network I/O to disk becomes the bottleneck—recall from disk is 1-10ms per request, plus network latency.

Vertical Fails Here: Even a 3TB instance (x2iezn.32xlarge) runs out of memory. CPU scaling doesn't help because the bottleneck is memory bandwidth and I/O, not compute.

Horizontal: Vector Database Cluster: Deploy Milvus or Weaviate across 4 nodes. Each node holds 25M embeddings (75GB) in RAM. Architecture: (1) Client queries for embeddings of user_id=12345. (2) Router layer (API gateway) hashes user_id to node 2. (3) Node 2 performs HNSW (Hierarchical Navigable Small World) search: finds 1000 nearest neighbors in 2ms. (4) Return results to client in total <10ms (1ms network + 2ms search + 2ms return + 5ms jitter).

Scale Linearly: Add more users? Add more nodes. 200M embeddings? 8 nodes. 500M? 20 nodes. Each node stays at comfortable 20% CPU (room to breathe).

Cost Comparison: Vertical: 1 × r7g.16xlarge (512GB RAM) = $24K/month, CPU-bound, can't add more memory. Horizontal: 4 × r7g.8xlarge (256GB RAM each) = 4 × $12K = $48K/month, but 2x capacity headroom, 4x better availability (lose 1 node, 3 survive), and can scale to 100+ nodes if needed.

Follow-up: Your recommendation cluster experiences a hotspot—one user (celebrity with 100M followers) causes all 4 nodes to query simultaneously for their embeddings. CPU spikes 95% on all nodes simultaneously. How would you solve this without adding more nodes?

You're building a social media feed. 500M monthly users, each with ~200 following. Computing a personalized feed (pull model) requires reading posts from 200 users and ranking them—this is 100-500ms per user per request. Most users refresh feed 10x/day. At peak, this is 50 billion ranking operations/day. Should you compute feeds on-demand (vertical) or pre-compute (horizontal caching)?

On-Demand Vertical Approach: Every feed request triggers ranking service. Service has 1000 instances (each handling 50K req/sec). Each instance spins up a "feed rank" job: fetch posts from 200 following, rank by engagement, return. Cost: enormous compute bill ($500K-1M/month). Latency: 200-500ms per feed load. Users complain about slow feeds.

Pre-Compute Horizontal Approach: (1) Batch jobs pre-compute feeds for all 500M users at ~2AM UTC (off-peak). Each user's feed is pre-ranked and stored in Redis. Cost per user: ~5ms CPU, so 500M × 5ms = 2.5M seconds = 694 hours of compute = 694 × $0.10 = $70/day. (2) When user opens app, fetch pre-computed feed from Redis in <10ms. Feed is 10 hours stale but users don't notice. (3) In real-time, inject fresh posts (posts from users they follow in the last hour) on top of cached feed. (4) Use probabilistic push: when a celebrity posts, broadcast to top 10K followers' Redis feeds immediately. Skip long-tail followers to save compute.

Cost Difference: Pre-compute = $70/day ($25K/year). On-demand = $15K/day ($5.5M/year). Savings: $5.5M/year.

Trade-off: Pre-compute adds operational complexity (batch job orchestration, cache invalidation logic) but saves massive compute cost. On-demand is simpler but financially unsustainable at scale.

Follow-up: Your pre-compute batch job fails (software bug, timeout after 12 hours). 500M feeds don't update for 24 hours. Users see old feeds. How do you handle this failure gracefully?

You run a time-series monitoring system. Ingestion: 100M metrics/sec, each metric is 50 bytes. Storage: retain 90 days of data. Compression: 5:1 average. Query latency p99 must be <100ms. Should you use a single big database or sharded cluster?

Data Volume: 100M metrics/sec × 50 bytes × 86400 sec/day × 90 days / 5 (compression) = 777 petabytes per year... wait, let me recalculate. 100M/sec × 50 bytes × 86400 sec/day = 432 petabytes/day, compressed to 86 petabytes/day. Retained 90 days = 7.7 exabytes. That's... impossible on any single machine.

This Mandates Horizontal Sharding: You must shard by metric_id or timestamp. (1) Shard by time: each shard holds 1 week of data. 13 shards total (90 days / 7). Each shard is ~1.2 petabytes uncompressed. (2) Store each shard on TSDB cluster (InfluxDB, Prometheus, TimescaleDB). Scale: 4 nodes per shard × 13 shards = 52 total nodes. (3) Sharding strategy: metric ingestion streams → consistent hash → shard router → 13 TSDB clusters. Query latency: query broadcasts to all 13 shards (parallel), each shard scans 1 week, merges in memory, returns in <100ms total.

Cost Reality: 52 nodes × $8K/month (r6i.4xlarge SSD instances) = $416K/month. No vertical scaling can achieve this—single instance would cost infinitely more and still couldn't hold 7.7 exabytes.

Operations: Sharding adds complexity: multi-shard queries, rebalancing when nodes fail, coordinating compaction across shards. But it's the only viable architecture at this scale.

Follow-up: Your cardinality of metrics is extremely high (metric_id, host_id, region, environment, service_name = potentially 10B unique combinations). Time-based sharding doesn't help cardinality explosion. How would you prevent the TSDB from being crushed by cardinality?

You're designing backend infrastructure for a mobile game. 5M concurrent users, each sending position updates 10x/sec. That's 50M updates/sec. Latency requirement: player movement appears within 500ms to other nearby players. Should you use a massive central server or distributed edge servers?

Vertical (Single Central Server): Massive GPU-accelerated server handles all 50M updates/sec. Seems simple, but physics fails: 50M updates × 100 bytes per update = 5GB/sec of network throughput. Largest AWS instance has 200Gbps = 25GB/sec theoretically, but in practice 10-15GB/sec is realistic. You'd max out network first, before CPU even warms up. Plus, latency: if all players connect to us-east-1, a player in Sydney sees 300ms network latency alone. Single server = single point of failure.

Horizontal (Edge Servers): (1) Deploy game server clusters in 6 regions: us-east, us-west, eu-west, ap-southeast, ap-northeast, sa-east. Each region handles ~800K concurrent users. (2) Each region gets 8 game server nodes (100K users/node). (3) Position updates sent to nearest region (~50ms latency), processed locally, then replicated to other regions asynchronously. (4) For nearby player visibility, only players within 100m distance get position updates at 10Hz. Distant players get updates at 1Hz (networking optimization).

Scale Math: Each node handles 100K users × 10 updates/sec = 1M updates/sec per node. At 8 nodes/region, that's 8M updates/sec per region. Sustainable with 96-core CPU server (~100 nanoseconds per update, plenty of headroom).

Result: 500ms latency achieved (100ms region latency + 300ms network + 100ms processing). Scale to 50M users with 60 total nodes instead of 1 colossus.

Follow-up: Players in region A and region B interact (player A shoots player B). Shots require <300ms round-trip latency to feel real-time. But region A→region B replication is 100ms one-way = 200ms round-trip minimum. How do you handle this latency without sacrificing consistency?

You're upgrading a monolithic API running on a single beefy server (384 cores, 1.5TB RAM, $40K/month). It's at 70% CPU. You want to scale horizontally with 10 cheaper nodes. But you realize the monolith has 50 background workers (batch jobs, message consumers) baked into the same process. How do you split this?

The Monolith Problem: That 384-core server isn't 70% CPU because HTTP traffic is heavy—it's heavy because background workers and batch jobs are consuming half the cores. If you naively spin up 10 nodes, each node would be underutilized if it still runs all background workers.

Decomposition Strategy: (1) Extract background workers into separate service: create a "worker pool" microservice. This runs just the batch jobs, message consumers, async tasks. Deploy 5 worker nodes (96 cores each). Cost: 5 × $4K/month = $20K/month. (2) API service now runs just HTTP handlers. Deploy 8 API nodes (96 cores each). Cost: 8 × $4K/month = $32K/month. (3) Routing: HTTP traffic → API load balancer (8 nodes). Background jobs → worker load balancer (5 nodes). Both share infrastructure but aren't resource-constrained by each other.

Result Comparison: Old: 1 × $40K/month. New: 13 × $4K = $52K/month (slightly more expensive initially). But: (A) API now has 8x redundancy (lose 1 node, 7 remain). (B) Can scale API and workers independently. (C) Can upgrade workers without affecting API SLA. (D) Workers can be bursty (spin up 20 workers at 2AM, kill them at 3AM for off-peak batch jobs).

Hidden Benefit: Observability improves. You now have two separate service metrics instead of one muddy 384-core metric.

Follow-up: After splitting, you realize API and workers share a connection pool to the database (max 100 connections). Now they're competing: workers consume 80 connections, API starves at 20. How do you allocate pool fairly?

Your data warehouse (Redshift) runs 1000 concurrent queries for reporting dashboards. Query time is 30-60 seconds on average. Your cluster has 128 nodes (ds2.xlarge, 2TB SSD each = 256TB total). CPU stays at 40%, memory at 30%, but queries are slow. Your finance team asks: why not add 128 more nodes (vertical + horizontal = go big)? What's your actual answer?

The Red Herring: Low CPU/memory utilization doesn't mean capacity is the bottleneck. Queries are slow because of one of these: (1) Query plans are terrible (full table scans instead of index lookups). (2) Data distribution (skew) causes uneven load across nodes. (3) Network bandwidth between nodes is saturated. (4) Disk I/O is slow (magnetic vs SSD mismatch).

Investigation: Run query profiling. Discover: 80% of CPU time is spent in "Scan" operations (sequential table reads). This means query planner is bad, not compute. Adding 128 more nodes solves nothing—you still scan everything.

Real Fix (Horizontal Architecture, Not Scaling): (1) Redesign schema: move from normalized OLTP schema to star schema (fact/dimension tables). Queries now join small dimension tables (billions of rows) instead of huge fact table (trillions). (2) Add sort keys: sort large tables by date + region (common query predicates). Queries now only scan 10% of data instead of 100%. (3) Implement materialized views: pre-aggregate data for common dashboards. Dashboard query now reads pre-computed aggregates (100GB) instead of raw table (10TB). Time drops from 60s to 2s.

Cost Impact: Old: 128 nodes. New: 64 nodes (after optimization). Cost drops 50%. Query latency drops 97%. Adding hardware would have been waste.

The Pattern: If utilization is low but latency is high, the problem is architecture, not capacity. Fix architecture first, then scale horizontally as needed.

Follow-up: After optimization, dashboards run in 2s. But one dashboard (executive summary) queries 5 different fact tables and joins them. That specific query is still 45s. The dashboard owner says "it's a critical query, just make it fast." How do you handle this?