System Design Interview — Back-of-Envelope Estimation

Your product team wants to launch a real-time analytics dashboard serving 100M DAU. You need to estimate monthly storage for raw events and bandwidth costs. Assume: 50% of DAU active daily, 8 hours/day, 10 events/user/hour, each event 500B. What's your estimate breakdown and which costs concern you most?

Monthly storage: 100M × 50% × 30 × 8 × 10 × 500B = 600PB raw. With 7:1 compression (typical for time-series JSON), store ~85PB in lake. Bandwidth: Same 50M active, 8h peaks = 50M × 10 events/h ÷ 3600 = 139K events/sec × 500B = 69.5 GB/sec ingestion. Egress (dashboard queries + APIs): assume 10% read ratio = 7GB/sec. Monthly cost breakdown: S3/blob ~$17k (at $0.023/GB), transfer out ~$600k (at $0.12/GB), compute (compression, indexing) ~$50k. The bandwidth egress dominates—consider tiered storage (hot 7d, warm 90d, cold) and CDN caching.

Follow-up: If you implement 95% cache hit rate on your dashboard tier, how does that reshape your egress costs? What sampling strategy for the remaining 5% misses?

A payment processor needs to estimate infrastructure for a Black Friday spike: 10x normal 500K TPS baseline. Each transaction: 2KB request, 1KB response, requires 3 DB round-trips. Infrastructure runs on $0.10/compute-hour and storage grows 20% during sale week. Current monthly bill: $2.8M. What's your spike cost delta and how do you justify it to CFO?

Normal: 500K TPS × 3 round-trips × 3KB avg = 4.5GB/sec DB load. Storage baseline: ~100TB (estimated from $2.8M/30d bill structure). Spike 10x: 5M TPS × 3 = 15GB/sec DB load. Over 72h (Fri-Sun): compute needs 50x replicas vs baseline. If baseline = 1000 instances, spike = 50K instances. Cost: 72h × 50K × $0.10 = $360K for compute spike. Storage adds 20TB × 1-2 weeks = ~$50K. Total delta: ~$410K. Pitch: "Customer lifetime value spikes 4x during Black Friday. $410K investment returns $8-12M incremental revenue. Break-even in hours." Include: reserved capacity discount (pre-buy 30% of spike at 40% savings), multi-region failover cost, and DDoS mitigation.

Follow-up: How do you model the 30% reserved discount? What if your surge prediction is off by 40%?

A video platform needs to estimate CDN costs for 50M MAU watching 1 hour/day average, 4 Mbps stream bitrate, 30% of users on mobile (lower bitrate). Current CDN: Cloudflare. You have regional data centers but limited peering. Estimate monthly CDN egress cost and identify optimization levers.

Watch time: 50M × 1h × 0.3 (monthly active) = 15M hours/month. Mobile (30%): 15M × 0.3 × 1h × 2 Mbps = 11.25PB/month. Desktop (70%): 15M × 0.7 × 1h × 4 Mbps = 31.5PB/month. Total: 42.75PB egress/month. At Cloudflare uncharged rates $0.12/GB = $5.13M/month. Optimization levers: (1) HTTP/3 QUIC + brotli compression = 15-20% bandwidth reduction, saves $770k-$1M; (2) Regional origin cache in Asia/EU reduces tier-1 egress 40%, saves ~$2M; (3) ABR (adaptive bitrate) + codec optimization (VP9/AV1) = 25% bitrate reduction, saves ~$1.3M. Total potential savings: ~$4M/month. Present as: tier infrastructure capex $1.2M one-time vs $4M/month savings = 3.6 month payback.

Follow-up: How do you measure if regional caching actually saves you 40%, and what's your deployment risk if it underperforms?

A SaaS platform (250K customers, $50/month ARPU) experiences 20% YoY growth. Infrastructure cost today: $8M/year. CFO wants cost-per-customer to drop 15% while handling growth. Current infrastructure: 500 bare-metal servers at $3k/month each. Estimate required infrastructure spend post-growth, then propose optimization strategy.

Current: 250K customers × $50 = $12.5M ARR. Infrastructure = $8M/year = 64% margin cost. Cost-per-customer today: $8M ÷ 250K = $32/customer. Target: $32 × 0.85 = $27.20/customer. With 20% growth: 250K × 1.2 = 300K customers, target spend = $300K × $27.20 = $8.16M. But naive growth would cost $8M × 1.2 = $9.6M (32 extra servers = $96K/month). Gap to close: $1.44M. Strategy: (1) Consolidate to cloud (AWS Reserved Instances) = 40% cost reduction = $3.2M savings; (2) Containerize workloads, use Kubernetes auto-scaling = avoid 32 servers, save $384K; (3) Optimize queries/schema = reduce DB resources 20% = $160K savings. Total savings: $3.744M offsets growth spend. Final cost: $8M + $1.6M - $3.744M = $5.856M for 300K customers = $19.52/customer (40% below target).

Follow-up: Your cloud migration takes 6 months but growth is now. How do you bridge the gap without blowing budget?

A real-time messaging platform (Telegram-scale) needs to estimate infrastructure for 500M MAU. Assume: 30M concurrent peak users, 50 messages/user/day average, 100B per message, 99.99% availability SLA. Estimate: database throughput (ops/sec), storage 1-year retention, and replica strategy.

Daily messages: 500M × 50 = 25B messages/day = 2.5B/hour peak. Peak concurrency (30M): ~1M ops/sec peak throughput (write-heavy, reads are cache-hit). Storage: 25B messages/day × 100B × 365 days = 912PB/year (uncompressed, realistically ~100-150PB with compression + dedup). Replica strategy for 99.99% (52.6 min downtime/year): 3x replication minimum (1 primary + 2 standbys). If primary fails, promote standby in <30sec. Geo-distribution: US (primary + replica), EU (replica), APAC (replica). Database: sharded Cassandra or DynamoDB (900 shards × 30 nodes each = 27K cluster nodes, but cost prohibitive). Better: PostgreSQL with 9 shards, 3 replicas each (27 nodes), ~$5M infrastructure. Ops/sec capacity: 27 nodes × 100K ops/sec = 2.7M ops/sec headroom (2.7x peak).

Follow-up: How do you handle a 2-hour regional outage in EU? Walk through failover, data consistency, and time-to-repair.

A fintech platform processes 100K payment transactions/hour with 2-second response SLA. Each transaction: 10KB request, validation calls 4 microservices (avg 50ms each), writes to 3 databases. Estimate backend instance count, network bandwidth, and failure scenarios (assume 99.95% uptime target per service).

TPS: 100K/hour = 28 TPS steady, 100 TPS peak (3.5x factor for burst). Response SLA 2sec, but latency budget: request (100ms) + 4 validations (200ms) + DB writes (300ms) + network/overhead (300ms) = 900ms. Headroom: 1100ms. Request throughput: 28 TPS × 900ms = 25K concurrent connections. Assume 100 req/sec per instance: need 1 primary + 3 replicas = 4 instances. Validation microservices: 4 services × 100 req/sec = 400 req/sec. If each service handles 50 req/sec: 8 instances per service = 32 total. Database: 3 DB writes, assume 500 req/sec capacity per DB = 1 primary + 1 replica per DB = 6 DB instances. Network: 28 TPS × 10KB = 280KB/sec = 2.24 Mbps steady, 8 Mbps peak. Failure resilience: if any service fails at 99.95%, probability of one failing in 1 hour = 1 - (0.9995^4) = 0.002 = 0.2%. Circuit breaker + retry (2x backup) = effective 99.98% availability. Cost: 4 + 32 + 6 = 42 instances × $50/month = $2.1K/month.

Follow-up: One validation service (fraud check) now takes 200ms instead of 50ms. What's the new latency tail and do you auto-scale?

An ad tech platform must estimate query latency for real-time ad bidding. 100K ad requests/sec, 50ms budget for winner selection (bid database lookup + targeting rules engine). Assume: uniform random distribution, 99th percentile latency target 40ms. Estimate required index strategy, caching layer, and query execution plan.

50ms budget: network (10ms) + DB lookup (20ms) + rules evaluation (15ms) + reply (5ms). DB lookup is critical path. At 100K req/sec, even at 99.9% hit, 100 misses/sec hit DB. B-tree index on primary key (ad_id, user_segment): O(log N) lookup. N = 10M active ads. Lookup time: ~17 comparisons × 2 microsec/comparison = 34 microsec + I/O (4ms SSD random read). Achievable: 4.034ms. Cache layer (Redis): 99.5% hit rate on hot 1M ads (working set). Miss rate: 500/sec × 4.034ms = 2sec latency on misses—violates budget. Solution: (1) Bloom filter pre-filter (negative lookups, 1 microsec) = reduce DB misses 50%; (2) L1 in-process cache on bid server (100K hot ads, 10MB) = 99.9% hit; (3) Parallel rule evaluation (GPU-accelerated or SIMD targeting), 5ms vs 15ms. With these: p99 latency = 10ms network + 4ms cache hit + 5ms rules + 5ms = 24ms. Headroom: 26ms.

Follow-up: Your Bloom filter has 0.1% false positive rate. How many phantom DB queries per second, and does that break your SLA?

You're designing storage for a machine learning platform: 500K experiments/year, each generates 100GB intermediate data (checkpoints, logs), and 10TB final model artifacts. Estimate: monthly storage bill, retention policy trade-offs, and cost optimization strategy. Assume S3 ($0.023/GB) with intelligent-tiering.

Daily experiments: 500K ÷ 365 = 1,370/day. Storage accrual: 1,370 × 100GB (intermediate) = 137TB/day = 4.1PB/month intermediate. Final artifacts: 1,370 × 10GB = 13.7TB/day = 410TB/month. Total accrual: 4.51PB/month = 136TB/day. Raw cost at S3 standard: 4.51PB × 30 × $0.023 = $3.12M/month (!). Retention policy trade-off: (1) Keep intermediate data 7 days only (most re-runs within week): 7 × 4.1PB = 28.7PB. Transition to Glacier after 7 days (cost $0.004/GB, 12-hour retrieval). Cost: 28.7PB × $0.023 + (4.51PB × 30 - 28.7PB) × $0.004 = $660K + $115K = $775K/month. (2) Keep final artifacts 2 years: 2 × 365 × 0.41PB = 300PB. S3 intelligent-tiering auto-transitions to archive after 180 days unused ($0.0036/GB). Cost: 300PB × $0.0036 × 2/3 (archive tier) + initial standard = $216K + $52K = $268K/month. Total optimized: $1.04M/month. ROI: $3.12M → $1.04M = $2.08M/month savings.

Follow-up: A researcher retrieves a 2-month-old intermediate dataset (Glacier cold, $200 retrieval fee + 12-hour wait). How do you prevent this UX pain at scale?