System Design Interview — Caching Strategies and Cache Invalidation

Your Redis cache hit rate drops from 95% to 40% immediately after a schema migration (added 3 new fields to user object). No code changes to cache logic. Requests still hit cache keys, but values are now invalid (fields missing). Walk through the root cause and recovery.

Schema migration without cache invalidation causes stale serialization. Old cached objects: {"id": 1, "name": "Alice", "email": "..."}. New schema adds: {"id": 1, "name": "Alice", "email": "...", "phone": null, "address": null, "updated_at": null}. Cache contains old format, code tries to access new fields (phone, address), gets null/undefined. Application logic fails, cache miss reported. Hit rate drops from 95% (old format accepted) to 40% (mixed old/new format, most old are rejected). Root cause: **no cache versioning or TTL on schema migration**. When schema changes, old cached objects become invalid. Solutions: (1) **Cache versioning with prefix**: include schema version in cache key. Old key: "user:1". New key: "user:v2:1". Migration adds new keys, old keys expire naturally (TTL). During transition: new requests cache miss, get fresh data, cache in new format. Old format gradually ages out. Cost: temporary higher miss rate (5-10 minutes), but converges. (2) **Set TTL = 5 minutes before migration**: reduce TTL on all cache entries 5 minutes before migration. Entries expire, get refreshed with new schema. Cost: cache hit rate naturally drops pre-migration as TTL expires. (3) **Event-driven invalidation**: migration script broadcasts event "schema_version_change", app subscribes. On event, cache key prefix changes (v1 → v2), all clients get cache miss. Cost: requires event system, adds complexity. (4) **Dual-read during transition**: app checks if cached object has new fields. If missing, treat as miss, fetch fresh data, update cache. Code changes required. Best approach: **(1) cache versioning**. Implement pre-migration: (a) reduce TTL to 5min. (b) At migration start, create new version key prefix. (c) New requests use new prefix, cache miss, fetch fresh. (d) Old requests use old prefix, hit cache (if not expired), but app gracefully handles missing fields (null-coalesce). (e) After 30 minutes, old prefix entries mostly expired. Gradual transition, minimal outage.

Follow-up: If phone field is critical (payment verification), and old cache hits are now returning null, how do you prevent transaction failures during transition?

A product team frequently updates feature flags, which are cached for performance. Flag cache TTL = 1 hour. Feature flag update takes 5 minutes to propagate to all users because of cache staleness. Business wants updates to propagate in <30 seconds. Changing TTL to 30 seconds reduces cache hit rate from 90% to 78% (more DB hits). Suggest a solution that keeps hit rate high and propagates changes fast.

TTL = 1 hour is too long for feature flags (need fast propagation), but 30 seconds TTL is too short (hits DB too often). The tradeoff exists because TTL is blunt tool. Better solutions: (1) **Event-driven invalidation**: feature flag update publishes event to Redis pub-sub. All cache nodes subscribe, immediately invalidate affected keys. Propagation: ~100ms. Implementation: flag update → emit event "flag_updated:feature_X" → Redis pub-sub → cache nodes receive → delete cache key. No TTL needed; cache lives until explicitly invalidated. Cost: add event infrastructure, handle network failures (if cache node misses event, it still has stale data). (2) **Hybrid: TTL + event**: keep TTL = 1 hour as safety net. Also use event-driven invalidation. Fast propagation via events (30 seconds), eventual safety via TTL (1 hour if event lost). (3) **Versioned cache with polling**: cache includes version number. Feature flag update increments version. Cache nodes poll every 10 seconds for version change (lightweight). If version changed, invalidate. Cost: 10-second polling latency. (4) **Hierarchical cache**: L1 in-process cache (TTL = 30 sec). L2 Redis cache (TTL = 1 hour). Invalidation affects both. Propagation: event → invalidate Redis → in-process caches check Redis on next miss. (5) **Read-through cache with listener**: application subscribes to flag change stream. On change, app proactively evicts cache before update published. Zero propagation latency. Recommended: **(1) event-driven** for real-time requirements. Implement: flag update → emit event → cache nodes delete key instantly. Fallback: TTL = 5 minutes for disaster recovery. Hit rate: 90% (similar to 1-hour TTL, because invalidation keeps cache fresh). Propagation: <1 second.

Follow-up: Your feature flag is deployed to 1000 cache nodes (distributed Redis cluster). Event propagation takes 5 seconds (network delay). Do some nodes still serve stale data?

You implement write-through cache: every write goes through cache first, then DB. A user updates their profile. Write succeeds in cache but fails in DB (connection timeout). Cache now has newer data than DB. User sees update (cache hit), but it's not persisted. DB recovers in 5 minutes. What's the recovery strategy?

Write-through failure causes cache/DB divergence. Write succeeds in cache, fails in DB. User sees stale-looking data (actually forward-dated). DB recovers, but cache still has newer data. Classic write-through issue: **asymmetric success**. If DB write fails, cache should be rolled back or marked stale. Solutions: (1) **Rollback cache on DB failure**: write → cache (success) → DB (fail) → cache.delete(key). User sees immediate cache miss + DB read, gets stale data again. Cost: extra delete operation, potential race condition. (2) **Dual-write with compare-and-swap (CAS)**: write → cache with version V1 → DB with version V1. If DB fails, next read sees cache V1 + DB is missing. Application logic: if DB version < cache version, resync. Cost: schema changes (add version), complex recovery logic. (3) **Cache as write buffer + async sync**: write → cache only (success immediately). Async job syncs cache → DB every 1 second. If sync fails, job retries. During failure window, cache is ahead of DB (acceptable for soft data like profiles). On DB recover, async job syncs accumulated writes. Cost: cache is temporary source of truth (risky for critical data). (4) **Write-through with fallback**: write → cache + DB (in parallel). Wait for both to succeed. If one fails, rollback both. Requires distributed transaction or saga pattern. Cost: complexity, potential deadlocks. (5) **Cache invalidation on DB failure**: write → DB first (safe). If DB fails, return error immediately (don't cache). Cache hit rate drops, but consistency is guaranteed. On retry, write → DB (success) → cache. Cost: more cache misses during DB issues. Best approach depends on data criticality: (a) **Soft data (profile)**: use (3) write buffer + async sync. Cache is temporary buffer. (b) **Critical data (payment)**: use (5) invalidate on DB failure. Don't cache if DB write failed. (c) **High-availability requirement**: use (2) versioned cache. Accept complexity for recovery. Recommended: **(5) for production**. If DB write fails, immediately return error, don't cache. Retry at application level. Cost: clearer semantics, simpler recovery.

Follow-up: Your async sync job (option 3) syncs 1000 buffered writes to DB after recover. This takes 10 minutes. Do users see inconsistent data during replay?

A Redis cache key is accessed by multiple services (user-profile cache shared by auth, payment, notification services). One service updates user profile, invalidates cache. But another service is in the middle of reading the cache (reading response streaming). That service gets partial data (partially invalidated). How do you prevent cache invalidation during read?

Race condition: service A reads cache, gets response streaming. Service B invalidates cache mid-stream. Service A receives partial data (first 50% of response, rest is gone). Consequences: incomplete user profile, payment service sees missing payment method, charge fails. Root cause: **no locking or versioning during cache reads**. Solution: (1) **Read lock**: before cache read, acquire shared lock. During read, exclusive lock (invalidation) blocks. After read, release lock. Implementation: Redis SET key:lock value EX 100 NX. Cost: adds latency (lock acquisition ~5-10ms per request), deadlock risk if lock timeout is too short. (2) **Copy-on-write semantics**: on invalidation, don't delete cache key. Instead, increment version number. Readers check version before reading. If version changed mid-read, rollback to version N. Cost: complex, version tracking required. (3) **Read-through cache with snapshot isolation**: cache reader acquires snapshot ID (e.g., sequence number). All reads use snapshot ID. Invalidation creates new snapshot, old snapshot readers complete. Cost: memory overhead (multiple snapshots), eventual consistency window. (4) **Atomic replace, not delete**: instead of deleting cache key, replace with new value atomically. Readers get either old or new value, never partial. Cost: if new value not ready, readers still hit stale data (acceptable for soft data). (5) **Versioned cache entries**: cache key includes version suffix (user:123:v1). On update, write new version (user:123:v2), then redirect clients to v2. Old version v1 stays readable by in-flight readers until TTL expires. Cost: multiple versions in cache (memory), garbage collection of old versions. Best approach: **(5) versioned cache** for high-availability systems. Invalidation → write new version → old version remains readable → in-flight readers complete safely → garbage collect old version after 10 minutes. Cost: ~2x memory for versioned keys, but zero dropped requests. Propagation: new readers immediately see v2 (updated data).

Follow-up: Versioned cache now contains user:123:v1, v2, v3, v4 (4 versions live). How do you prevent unbounded version growth?

A cache warming job pre-loads user profiles into Redis before serving traffic. 10 million profiles × 1KB each = 10GB loaded over 10 minutes. During loading, a user logs in and requests their profile. Cache miss (profile not pre-loaded yet). User's request is expensive (DB hit, slow response). But 30 seconds later, cache warming loads that profile. Next request is fast. How do you coordinate cache warming with live requests?

Cache warming creates a temporal inequality: some users (loaded early) get fast responses, others (loaded late) get slow responses. Issue compounds if cache warming stalls or is incomplete. Solutions: (1) **Prioritized cache warming**: rank user profiles by access frequency (using historical data). Load hot profiles first (10% of users = 80% of traffic). Warm 1M hot profiles in 1 minute, then slow-warm remaining 9M. Most users hit cache within minute. Cost: requires profiling phase, complexity. (2) **On-demand warming**: user requests profile, miss → DB hit, response to user, then async cache warming. Next user gets hit. Cost: first request slow, but subsequent requests fast. Mitigates thundering herd (if many users request same profile, only first is slow, rest are fast after cache warms). (3) **Tiered warming**: Level 1 (warm only critical profiles like admins, VIPs): 100K profiles in 30 seconds. Level 2 (warm active profiles): 1M profiles in 5 minutes. Level 3 (lazy warm): on-demand. Reduces slow responses to critical users. Cost: multi-tier logic, requires segmentation. (4) **Cache with predictive loading**: based on user context (geographic region, time of day), predict who'll log in next and pre-load their profiles. E.g., morning US time → load US profiles. Cost: prediction accuracy required. (5) **Hybrid: read-aside + warming**: cache warming happens in background. Live requests use read-aside (miss → DB → cache). Both jobs run concurrently. Cost: DB hit during cache warming, but no request stalls. Recommended combination: **(1) + (5)**. Priority warm top 10% users in 1 minute (covers 80% of traffic). Remaining profiles warm in background. Live requests use read-aside. Net effect: <5% of users see slow response during warmup (those requesting profiles not yet warmed). After 10 minutes, cache fully warm, all requests fast.

Follow-up: If 5% of users see slow response (DB hit during warmup), and each DB hit costs $0.01 in compute, what's the cost/benefit of prioritized warming vs no warming?

A cache invalidation strategy uses pub-sub: when data updates, broadcast invalidation message. Cache nodes subscribe, delete stale keys. But during a deployment, a cache node crashes and restarts. It missed invalidation messages published while down. Now it has stale data. Users see old values. How do you prevent this?

Pub-sub invalidation is non-durable for down nodes. Node crashes, misses messages, comes back online stale. Root cause: **pub-sub assumes all subscribers are always connected**. If subscriber is offline, messages are dropped. Solutions: (1) **Durable queues instead of pub-sub**: use message queue (Kafka, RabbitMQ) instead of Redis pub-sub. Queues retain messages. On node restart, replay missed messages from queue. Cost: queue infrastructure, slightly higher latency (10-50ms vs <5ms for pub-sub). (2) **Cache version tracking**: every data update increments version number (stored in central version table). Cache stores version with cached data. On cache miss or periodic refresh, check if cached version == current version. If not, invalidate. Cost: version tracking adds overhead, extra DB queries. (3) **Heartbeat + sync**: cache node periodically heartbeats with version hash. If heartbeat shows version mismatch, cache node re-syncs from master. Cost: periodic network traffic, eventual consistency window (sync takes seconds). (4) **Immediate re-cache on miss**: instead of trusting stale data, cache hit includes TTL. If TTL close to expiry, async-refresh from DB in background. If data changed (detected by version), invalidate immediately. Cost: background refresh overhead. (5) **Dual invalidation: pub-sub + eventual invalidation**: use pub-sub for real-time invalidation. Also set cache TTL (e.g., 15 minutes). If node misses pub-sub message, cache still expires naturally. Cost: transient stale data (up to TTL), but bounded. Recommended: **(1) durable queue** for critical data. For soft data, **(5) pub-sub + TTL**. Implementation: (a) Use Kafka instead of Redis pub-sub for invalidation messages. (b) Cache nodes consumer group subscribes to topic. On restart, consumer group replays missed messages from log. (c) Also set cache TTL = 30 minutes as safety net. Cost: 50ms additional latency vs pure pub-sub, but stale data is bounded to 30 minutes max.

Follow-up: Your Kafka queue for invalidation has 1M pending messages (node was down for 1 hour). Replay takes 5 minutes. During replay, cache is inconsistent (partial invalidation). How do you handle requests?

A lazy-load cache strategy: cache misses trigger DB fetch + async cache-set. But the async cache-set is slow (network overhead). Meanwhile, second request comes in, also misses, also triggers async cache-set. 100 concurrent requests → 100 async cache-set operations, overwhelming Redis. Cache thundering herd. How do you deduplicate?

Thundering herd: multiple requests miss cache simultaneously, all rush to DB and cache-set. Root cause: **no deduplication of in-flight requests**. Solutions: (1) **Request coalescing (wait group pattern)**: first request misses cache, locks the key. Other requests see lock, wait instead of fetching DB. First request fetches DB, caches, unlocks. Other requests see cache hit, proceed. Cost: requires lock mechanism (Redis SET NX), adds wait overhead (~50-100ms). (2) **Probabilistic early refresh**: cache stores TTL. At 80% of TTL, single request triggers async refresh. Other requests use stale cache (if available). Only first request pays refresh cost. Cost: serves stale data briefly (acceptable for soft data). (3) **Batch cache-set**: multiple in-flight requests buffer their cache-set operations, batch 100 operations into 1 Redis MSET. Reduces network round-trips. Cost: adds latency (wait for batch to accumulate). (4) **Single-flight (memoization)**: in-process cache of in-flight requests. Request 1 misses cache, starts DB fetch (promise). Requests 2-100 see promise, wait for same DB fetch. Result returned to all 100. Single DB query, 100 cache-sets. Cost: complex, requires promise/future semantics. (5) **Sharded locks**: instead of locking entire cache key, lock specific shard (user:123 → shard 5). Multiple keys can fetch concurrently, but same key is serialized. Cost: moderate lock contention. Recommended: **(1) request coalescing**. Implementation: (a) Request 1 misses, tries SET cache:user:123:lock value EX 10 NX. Success. (b) Request 2 misses, tries same SET. Fails. Waits 100ms, then GET cache:user:123. If set, return. If not, retry wait. (c) Request 1 fetches DB, caches value, deletes lock. (d) Requests 2-100 see cache hit, proceed. Cost: lock overhead ~50ms, prevents thundering herd entirely. Net: 100 concurrent requests → 1 DB hit + 100 cache reads, vs 100 DB hits + 100 cache writes.

Follow-up: Lock timeout is 10 seconds. If request 1 crashes mid-DB-fetch, lock stays alive. Requests 2-100 wait 10 seconds for lock timeout. Is this acceptable?

A mobile app uses a local SQLite cache (users table, 10K rows). Server pushes incremental updates via WebSocket (user X's profile changed). App must sync local cache with server state. But network is flaky (connections drop/reconnect). App can't reliably receive all push updates. Design a sync strategy that keeps cache eventually consistent.

Mobile cache sync is tricky: network unreliability, offline periods, stale push events. Root cause: **push is fire-and-forget (UDP-like), no ACK**. If app is offline during push, event is lost. Solutions: (1) **Timestamp-based sync**: server records update timestamp (UTC). On app reconnect, query server: "give me all updates since timestamp T." Server returns delta (users modified > T). App applies delta to local cache. Cost: app must track last_sync_timestamp, server must index by timestamp. Sync time: ~1-5 seconds (depends on delta size). (2) **Version vector per entity**: each user profile has version = {server_version: 5, app_version: 3}. On sync, compare versions. If server version > app version, fetch fresh data. Cost: per-entity version tracking, complex logic. (3) **Hash-based verification**: periodically compute hash of local cache (user count, checksum of user IDs). Compare with server hash. If mismatch, trigger full sync. Cost: expensive (full hash), but detects stale cache. (4) **Background sync job**: app runs background sync every 5 minutes (or on app resume). Query server for all user updates since last sync (using timestamp). Apply delta. Cost: battery drain (periodic network), but reliable. (5) **Hybrid: push + background sync**: push for real-time updates (when online). Background sync (every 5 min) as fallback to catch missed pushes. App prioritizes push, falls back to sync. Cost: double network calls, but high reliability. (6) **Partial sync with priority**: on app resume, don't sync all 10K users. Sync only recently-viewed users (priority list, 100 users). Full sync happens gradually in background. Cost: UX improvement (fast resume), eventual consistency (other users updated later). Recommended: **(4) + (5) hybrid**. Push for real-time when online. Background sync every 5 minutes (disabled on battery saver mode). On app resume: check if last sync > 5 min ago. If yes, trigger sync. Cost: ~1-2% battery drain (periodic sync), network calls acceptable. Time to eventual consistency: <5 minutes. Alternative for strict consistency: **(1) timestamp-based sync**. On reconnect, sync immediately. Time to consistency: ~1-5 seconds. Cost: higher network traffic.

Follow-up: During background sync, user opens app and manually edits a user's profile (offline). App has local change, but server also has concurrent change. Sync now sees conflict. How do you resolve?