System Design Interview — Search Architecture

Your e-commerce search indexes 500M products with full-text search and facets (color, brand, size, price). Search must return in <200ms (p99). Autocomplete queries return within 50ms for product names as user types. Your Elasticsearch cluster currently uses 300 nodes (8TB each = 2.4PB storage, $400K/month). Is this over-provisioned or under-provisioned?

Capacity Analysis: 500M products × 2KB average document size = 1TB raw data. Elasticsearch stores both inverted index and source documents. Index size is 5-10x original data = 5-10TB. With 3 replicas for resilience, 5TB × 4 (original + 3 replicas) = 20TB working set. Your 2.4PB cluster can hold that 120x over. That's massive overprovisioning.

Root Cause Diagnosis: (1) Cardinality explosion: if your facets are high-cardinality (exact combinations of color × size × brand), Elasticsearch creates aggregation buckets that don't compress well. A 2KB document becomes 50KB in aggregations. (2) Replicas: 3 replicas means write cost is 4x. (3) Refresh interval: if you refresh index every 1 second, you're rebuilding segments constantly = wasted CPU.

Optimization Strategy: (1) Reduce replicas: use 1 replica instead of 3 (2x data). Availability drops from 99.99% to 99.9%, but for e-commerce search (non-critical), acceptable. (2) Tune refresh interval: search is read-heavy, not real-time. Refresh every 30 seconds instead of 1 second. Index size shrinks 30x (fewer tiny segments). (3) Optimize aggregations: pre-compute facet counts in a separate, tiny index. Main search index is text-only, no aggregation overhead. Facet queries hit the small index separately. (4) Use appropriate analyzers: if searching by exact color (red, blue, green), use keyword analyzer. If searching by product name (full-text), use standard analyzer. Don't analyze everything.

New Sizing: 5TB index × 2 (1 replica) × 1 (no aggregation overhead) = 10TB total. 50 nodes instead of 300. Cost drops to $50K/month. Same search latency.

Follow-up: After cutting replicas to 1, a node crashes. One shard replica is lost. Rebalancing takes 6 hours, during which search is slower (unbalanced). How do you balance availability with cost when node failures are expected?

Autocomplete query: user types "iph" (3 characters) → must return suggestions (iPhone, iPhone Pro Max, etc.) within 50ms. Your current setup uses Elasticsearch with prefix queries. As product catalog grows, queries slow down. At 500M products, "iph" prefix query scans millions of documents. How do you fix this?

The Performance Wall: Prefix query "iph*" requires scanning every product that starts with "iph". With inverted index, Elasticsearch can narrow it down, but at scale (millions of matches), returning top-10 suggestions still requires scoring, sorting = 100ms+. This breaks the 50ms autocomplete SLA.

Solution #1: Trie-Based Autocomplete Index: (1) Use a Trie (prefix tree) data structure dedicated to autocomplete. Build offline: for each product name, insert into trie. "iPhone" → add nodes: i → p → h → o → n → e. (2) Query: search "iph" → traverse trie (3 lookups) → get all products starting with "iph" → return top-10 by popularity. Latency: <10ms even for 500M products. (3) Store in Redis or Elasticsearch with a special "ngram" analyzer: store documents with analyzed fields like: "suggestions": ["i", "ip", "iph", "ipho", "iphon", "iphone"]. Query term "iph" matches the "iph" ngram. (4) Add popularity ranking: products with higher sales get priority in suggestions (sort by sales_count desc).

Solution #2: Dedicated Autocomplete Service: Deploy a small, fast service (e.g., Meilisearch, TypeSense) optimized for autocomplete. It's 10-50x faster than Elasticsearch for this workload. 500M products take 50GB in Meilisearch (vs 1TB in Elasticsearch). Cost: $20K/month vs $200K/month.

Hybrid Approach: Main search (full-text, complex queries) → Elasticsearch. Autocomplete → Meilisearch or Trie-based Redis. Best of both worlds.

Follow-up: Your autocomplete index is in Redis. When new products are added (1000/minute), the trie must be updated. Updates are slow (rebuilding trie = 30s). During rebuild, autocomplete is stale (doesn't suggest new products for 30s). How do you handle this?

You're searching product inventory with multiple filters applied: (category: "electronics") AND (brand: "Apple") AND (price: $500-$2000) AND (in_stock: true). With Elasticsearch, each filter narrows down results independently, then combined. Your query takes 500ms for 5M matching documents. How do you optimize?

Query Optimization: (1) Filter order matters: start with most selective filter first. "in_stock: true" eliminates 80% of products immediately (narrow to 100M). Then "brand: Apple" (narrow to 1M). Then "price range" (narrow to 100K). Finally "category" (narrow to 50K). Reorder filters by selectivity (most restrictive first). (2) Use filter cache: Elasticsearch caches filter results. If you search (in_stock: true) 1000 times/day, first query pays 100ms, subsequent queries get cached result (~5ms). (3) Separate indexing strategy: instead of single index with all documents, partition by in_stock status. Index 1 = in-stock products (100M). Index 2 = out-of-stock (400M). Query only hits Index 1. Eliminates 80% of documents immediately. (4) Use query_string or bool queries efficiently: avoid script queries (slow). Use term/range/filter clauses (fast).

Result: 500ms → 100ms (10x faster) by reordering filters + caching. 100ms → 20ms by partitioning index.

Trade-off: Partitioned index adds operational complexity (maintain 2 indices instead of 1). But for scale, worth it.

Follow-up: Your in_stock status changes constantly: products sell out, new stock arrives. Index partitions become stale within minutes. If user filters by in_stock=true but index is 2 hours stale, they see out-of-stock products. How do you keep partitions fresh?

Your search uses Elasticsearch. You want to add faceted search: "Narrow results by color". You have 10K unique colors across 500M products. Elasticsearch aggregation query for facets takes 2 seconds (must scan all matching documents, build histogram). Search latency is now 2s instead of 200ms. How do you fix this?

The Aggregation Cost: User searches "iphone" → 5M results. Then asks for color facets. Elasticsearch must: (1) Filter to 5M results matching "iphone". (2) Build aggregation on color field for all 5M documents. (3) Count how many are red, blue, green, etc. (4) Return facet counts. This is expensive because it touches 5M documents.

Solution #1: Separate Facet Index: (1) Create a "facet summary" table: (product_id, color, brand, size, price_range). This is denormalized and tiny (1KB per product instead of 10KB). (2) Main search query hits large index (5M documents). Facet query hits small facet index (counts per color). (3) Queries: Full-text search returns product IDs. Facet query reads separate index: "for these 5M product IDs, show color breakdown." Much faster because data is optimized for aggregation.

Solution #2: Pre-Computed Facets: (1) For each search term (iPhone, Samsung Galaxy, etc.), pre-compute facet counts offline: {color_red: 1.2M, color_blue: 800K, ...}. Store in Redis. (2) When user searches "iPhone", fetch pre-computed facets from Redis (instant). Update them with any new products added in last hour. (3) Trade-off: facets are ~1 hour stale, but search is sub-100ms.

Solution #3: Elasticsearch Side Optimization: Use cardinality_precision_threshold to limit aggregation work. Instead of exact counts, return approximate counts (off by <5%). This reduces CPU significantly.

Result: 2s → 200ms facet query. Search remains 200ms. Total: 400ms instead of 2.2s.

Follow-up: After implementing separate facet index, you notice facet counts disagree with search results. Search returns 500 "red iPhones", but facet index says there are 495. Why the discrepancy?

Your Elasticsearch cluster runs 100 nodes. Indexing 500M products. During bulk import (1M products/hour), cluster CPU jumps to 95%, search latency spikes to 5 seconds. After bulk import completes, cluster settles back to 40% CPU, latency <200ms. How do you isolate bulk indexing from search traffic?

The Problem: Resource Contention: Indexing requires: (1) parsing documents, (2) analyzing text, (3) building segment files (disk I/O), (4) merging segments. This competes with search queries for CPU, disk I/O, memory.

Solution: Dedicated Index Nodes: (1) Partition cluster into two groups: (A) 60 dedicated index nodes: handle bulk indexing. (B) 40 dedicated search nodes: handle read queries. (2) When indexing, index nodes process 1M products/hour at comfortable CPU (70%). Search nodes remain idle for search queries (0% CPU). (3) After indexing completes, copy indexed data from index nodes to search nodes (replication). Search nodes now have fresh data. (4) Index nodes are ready for next bulk import. Search nodes serve queries with zero interference from indexing.

Trade-off: 100 nodes instead of 50. Cost increases 2x. But you get: (A) Predictable search latency (never spiked by indexing). (B) Ability to scale indexing and search independently. (C) Can upgrade index nodes without affecting search SLA.

Alternative: Index Throttling: Instead of dedicated nodes, throttle indexing rate: instead of 1M/hour, do 100K/hour spread over 10 hours. This keeps CPU at 60% continuously (no spike). Search latency stays at 300ms (acceptable, slightly slower than ideal). Cost: same 50 nodes, but indexing takes longer. Better for scenarios where time is flexible.

Follow-up: With dedicated index nodes, you notice search latency is now 150ms (faster than before!). But index nodes are creating new segment files every 30 seconds. Segment merging (background process) competes with new writes. Index nodes CPU is at 80%. How do you accelerate indexing without adding more nodes?

You're migrating search infrastructure from Elasticsearch to OpenSearch (AWS fork). You need zero downtime. Both systems must be queried simultaneously during migration (canary + stability period). After 2 weeks, fully migrate to OpenSearch. How do you design this?

Dual-Write Migration Strategy: (1) Phase 0 (pre-migration): Set up OpenSearch cluster (100 nodes, identical to Elasticsearch setup). Test it for 1 week. (2) Phase 1 (week 1-2): Dual write. Application sends all index requests to BOTH Elasticsearch and OpenSearch simultaneously. (3) Verify: for 10% of queries, read from OpenSearch and compare results to Elasticsearch. If they differ, log discrepancy. Debug. (4) Phase 2 (week 3): Cutover 50% of read traffic to OpenSearch, 50% to Elasticsearch. Monitor latency and correctness. (5) Phase 3 (week 4): Migrate remaining 50% to OpenSearch. Elasticsearch still receives writes but serves 0% of reads. (6) Phase 4 (week 5): Stop writing to Elasticsearch. Delete Elasticsearch cluster. OpenSearch is now sole source of truth.

Rollback Plan: If OpenSearch has issues at any phase, revert: redirect reads back to Elasticsearch. This is instant (DNS change).

Data Consistency: During dual-write (phase 1), Elasticsearch and OpenSearch might diverge if indexing fails in one but not the other. Mitigation: track all index requests in transaction log. If OpenSearch is behind, replay logs from Elasticsearch to OpenSearch. By phase 2, both are consistent.

Follow-up: During dual-write phase, write latency increases (now writing to 2 systems). Latency goes from 50ms to 150ms. Customers complain. How do you reduce this without sacrificing consistency?

Your search system must return results in <100ms for 99th percentile. You use Elasticsearch with sharded indices (100 shards, 5 replicas = 600 total copies across 100 nodes). Query fan-out: each query hits all 100 shards. Some shards are slower than others (data skew), causing p99 latency to be 500ms. How do you reduce p99 without hitting all shards?

The Tail Latency Problem: If one shard out of 100 is slow (due to hot data, disk contention, GC pause), the entire query waits for that shard. This is "tail latency amplification"—adding more shards makes p99 worse, not better.

Solution #1: Adaptive Query Routing: (1) Track latency per shard: measure how long each shard takes to respond. (2) For queries, route to fastest N shards (maybe 80% of shards) instead of all 100. (3) For high-traffic queries, use probabilistic routing: "99% of searches hit shards 1-80 (fast), 1% hit all shards" (accurate but slower). (4) If you need 99.99% accuracy, hit all shards. For 99% accuracy (acceptable for search), hit 80 shards. Trade latency for accuracy.

Solution #2: Shard Rebalancing: (1) Identify slow shards (tail latency > 200ms). (2) Rebalance data: move data from slow shards to fast shards. If slow shard has 100GB of data, split it to 2 smaller shards (50GB each) on different nodes. Latency should drop. (3) Use forced merge before rebalancing: reduces number of segments, speeds up queries.

Solution #3: Caching Hot Results: (1) 80% of search volume comes from 20% of queries (Zipfian distribution). Cache results: first time "iPhone" is searched, result is cached. 99 subsequent "iPhone" searches return from cache (~5ms). (2) Use a distributed cache (Redis, Memcached) in front of Elasticsearch.

Result: p99 latency drops from 500ms to 120ms (adaptive routing) or 100ms (caching).

Follow-up: After implementing adaptive routing (hitting only fastest 80 shards), you notice some search results are missing (top-10 products don't include ones that should rank high). Why does hitting fewer shards cause incorrect results?

Your search infrastructure handles 1M queries/sec globally. One data center (us-east-1) goes down. All those queries must route to other data centers (us-west-2, eu-west-1, ap-southeast-1). Each remaining center is designed to handle 400K queries/sec. Aggregate capacity is now 1.2M queries/sec. But query latency spikes 10x (cross-data-center latency is 100ms vs 5ms local). How do you handle this?

Capacity Overflow Management: (1) When us-east-1 fails, Route53 stops routing traffic there. But 1M requests still arrive (from cached DNS, mobile app retry logic). These 1M requests get routed to other 3 centers. (2) Each center now receives 400K (local) + 333K (redirect) = 733K queries/sec. They can handle this (under 1.2M capacity). (3) But latency increases because queries from us-east-1 users now go to eu-west (100ms latency instead of 5ms local).

Mitigation Strategy: (1) Implement circuit breaker: if a query takes >500ms due to cross-region routing, return cached result from 1 hour ago instead of timeout. Users see slightly stale data but fast response. (2) Implement request coalescing: if 100 users search "iPhone" from us-east-1, coalesce into 1 request to eu-west-1. Return same result to all 100. Reduces traffic 100x. (3) Graceful degradation: if eu-west load is >90%, reject 10% of cross-region requests (send error to client: "Please try again later"). Prevents cascade.

Design for Failure: During normal operation, capacity is 1M queries/sec across 4 centers (250K each). When 1 center fails, remaining 3 can handle 750K queries/sec (acceptable, only lose 250K). This is the "N+1" design rule: capacity should be >1/N of total so you survive any N-1 failure.

Follow-up: Your circuit breaker returns cached results from 1 hour ago. But products have changed (prices updated, in_stock status changed). Users see old data. After us-east-1 recovers (30 minutes later), users refresh and see current data, causing jarring UX. How do you smooth this?