AWS Interview — ElastiCache: Redis vs Memcached at Scale

Your ElastiCache Redis cluster hit 100% memory. Evictions spiking to 10K/sec. Application latency increased from 50ms to 500ms. You have 2 choices: (1) scale up (add memory), or (2) scale out (add nodes). Redis is configured as cache (not persistence). Walk through the decision: memory cost vs eviction performance.

ElastiCache Redis memory/eviction trade-off analysis: (1) Symptom diagnosis: (a) 100% memory + 10K evictions/sec = aggressive cache thrashing. (b) Evictions mean data is removed to make room for new keys. Client requests for evicted data → cache miss → query origin (DB), slow. (c) Latency 50ms → 500ms suggests ~10x more cache misses (typical DB query = 100-200ms slower than cache hit). (2) Scale-up analysis (add more memory): (a) Current: cache.r6g.xlarge (26GB memory) = $0.47/hour = $340/month. (b) At 100% capacity with evictions, estimate working set = 26GB. (c) Scale to cache.r6g.2xlarge (52GB) = $0.94/hour = $680/month. Double cost. (d) Eviction rate would drop to 5K/sec (assuming working set grows 10%), but not eliminated. (e) Pros: simple, no code change. Cons: expensive, temporary fix (workload keeps growing). (3) Scale-out analysis (add more nodes): (a) Add second cache.r6g.xlarge node. Total: 52GB across 2 nodes (26GB each). Cost: $680/month (same as scale-up). (b) Replication: both nodes are replicas (multi-AZ for HA). Working set = 52GB (2x). (c) Eviction rate: drops to 2K/sec (workload distributed). (d) Throughput: 2 nodes can serve 2x requests in parallel. (e) Pros: better fault tolerance, some scaling benefit. Cons: replication overhead (each write replicated to 2 nodes = 2x network I/O). (4) Redis-specific considerations: (a) Redis is single-threaded for writes (all writes go to primary). Adding replicas doesn't help write throughput. (b) If write-heavy (many SET commands), scale-out doesn't help much. Scale-up (more memory on single node) is better. (c) If read-heavy (many GET commands), replicas help (distribute reads across replicas). (d) Diagnosis: check Redis command stats. If GETS >> SETS, scale-out. If SETS >> GETS, scale-up. (5) Working set estimation: (a) Query Redis: `INFO memory` → used_memory_peak. (b) If used_memory_peak = 26GB, working set is 26GB. Scaling to 52GB gives 2x headroom. (c) If working set growing 20% per month, 52GB is only 5-month solution. Plan long-term. (6) Eviction policy impact: (a) Current policy (default): allkeys-lru (evict least-recently-used keys). (b) Change to: allkeys-lfu (evict least-frequently-used). More cache-aware, but slightly slower. (c) Or: volatile-ttl (only evict keys with TTL set). Prevents losing permanent cache data. (d) Testing: try different policies, measure eviction rate + latency. (7) Recommendation: (a) Short-term (1 week): scale to cache.r6g.2xlarge. Latency should drop to 100-150ms. (b) Measure eviction rate + read/write patterns. (c) If read-heavy + evictions still high, scale-out to 2x r6g.xlarge (better than single larger node). (d) If write-heavy, stick with scale-up (single large node). (e) Long-term: implement cache invalidation strategy (TTL, event-driven updates) to reduce working set size naturally. (8) Cost-benefit: adding 26GB memory (scale-up) = $340/month. Benefit: reduce 500ms latency to 100ms = 10x speedup for 10K miss/sec = 100K seconds saved/sec = $4K-5K productivity gain/month. ROI: positive within 1 month.

Follow-up: You scaled to 2x r6g.xlarge. Evictions dropped to 2K/sec. Latency is 150ms (better but still elevated). Did you pick the wrong scaling strategy, or is there another bottleneck?

You're deciding between Redis and Memcached for a new caching layer. Requirements: (1) cache user sessions (10K concurrent users), (2) cache product recommendations (100M items, hot 1M in memory), (3) cache API responses (TTL-based). Which tool and why?

Redis vs Memcached for 3 use cases: (1) User sessions (10K concurrent): (a) Redis: supports expiration (TTL), sessions auto-delete after 30 min. Supports atomic operations (INCR, LPUSH). Good for counters (view count, click count). Data types: strings, hashes, lists. Ideal for sessions. (b) Memcached: simple key-value store. No TTL (must handle expiration manually). No atomic ops. Lighter weight, but less flexible. (c) Winner: Redis (TTL, atomic ops, data types make session management easier). (2) Product recommendations (100M items, hot 1M): (a) Redis: In-memory DB. 1M items × 1KB each = 1GB. Cost: cache.r6g.xlarge = $340/month. Can scan all keys with SCAN (distributed iteration), good for ML pipeline. (b) Memcached: same memory usage (1GB). Simpler, no overhead. Lighter weight. Cost: similar cache.m6g.xlarge = $250/month (cheaper). (c) Winner: Memcached (lighter weight, lower cost. Recommendations are stateless/recomputable, so TTL auto-expiration isn't critical). (3) Cache API responses (TTL-based): (a) Redis: TTL on individual keys. Each response auto-expires. Good for dynamic content (product prices, inventory). (b) Memcached: no TTL (must manually delete or use client-side tracking). More error-prone. (c) Winner: Redis (TTL is essential for API response caching). (4) Decision: use both (Redis + Memcached hybrid): (a) Redis: sessions + API responses (small, TTL-managed). ~100MB Redis. Cost: $50/month. (b) Memcached: recommendations (large, simple). ~1GB Memcached. Cost: $250/month. (c) Total: $300/month, optimized for each workload. (5) Operationally: (a) Redis: more complex (persistence options, replication, cluster mode). Requires ops expertise. (b) Memcached: simpler, fire-and-forget. Lighter ops burden. (c) If ops bandwidth is low, go 100% Memcached (accept limitations on sessions/TTL). If ops rich, use both. (6) Alternative: use DynamoDB for sessions (pay per request, no upfront cluster cost), and Memcached for recommendations (cheap, simple). Hybrid: DynamoDB + Memcached. (7) Cost-benefit: Redis + Memcached = $300/month. vs DynamoDB (sessions) + Memcached = $200/month (DynamoDB on-demand cheaper than Redis cluster). Trade-off: DynamoDB is slower (10-30ms vs 1-5ms Redis), but acceptable for sessions (not latency-critical). Recommendation: if latency is critical, Redis + Memcached. If cost-optimizing, DynamoDB + Memcached.

Follow-up: You split Redis (sessions) + Memcached (recommendations). But a developer uses Redis for recommendations too (misuse). Redis memory grows unexpectedly. How do you enforce per-cache policies?

Your Redis cluster has 99.9% cache hit rate (excellent) but latency is still 100ms P99. Load test shows: PING latency = 1ms, but typical GET latency = 50ms, batch GET (mget) latency = 100ms. The variance suggests: (1) network round-trips are not the bottleneck, (2) serialization overhead might be. Investigate and optimize.

Diagnose Redis latency despite high cache-hit rate: (1) Latency breakdown: (a) PING = 1ms (network round-trip). (b) GET = 50ms (network 1ms + parsing + retrieval = 49ms overhead). (c) mget (1000 items) = 100ms (network 1ms + parsing 99ms). (d) Root cause: serialization/parsing overhead, not network. (2) Serialization analysis: (a) Client serialization (before sending to Redis): JSON.stringify() can be slow for large objects (>10KB). (b) Redis parsing: Redis parses RESP protocol (simple but adds overhead). (c) Client deserialization (after receiving): JSON.parse() is slow. (d) Total: serialize (10ms) + network (1ms) + parse + deserialize (20ms) = 31ms just for serialization. (3) Optimization #1: use binary serialization (MessagePack, ProtoBuf, BSON): (a) Replace JSON with MessagePack: 50% smaller payload, 5x faster parsing. (b) GET latency: 50ms → 30ms. (c) Implementation (Node.js): ```javascript const msgpack = require('msgpack5')(); const redis = require('redis'); // before: JSON.stringify/parse // after: msgpack.encode/decode const value = {user_id: 123, name: "Alice"}; redis.set('user:123', msgpack.encode(value).toString('base64')); const encoded = redis.get('user:123'); const value = msgpack.decode(Buffer.from(encoded, 'base64')); ``` (d) Cost: negligible (CPU savings outweigh codec overhead). (4) Optimization #2: connection pooling (reduce round-trips): (a) If client opens new connection per request: 10ms handshake overhead. (b) Use persistent connection pool: create 10 connections, reuse. (a) Benefit: eliminates handshake for each request. But PING latency is already 1ms, so likely not the issue. (5) Optimization #3: pipelining (batch requests): (a) Instead of sequential GET requests (50ms each), send multiple GETs in one batch (pipeline). (b) Redis processes all in one write, returns results in one read. (c) Example: 100 sequential GETs = 5 sec. Pipelined GETs (mget) = 100ms. 50x faster. (d) Implementation: ```javascript // Sequential: const values = []; for (let i = 0; i < 100; i++) { values.push(await redis.get(`key-${i}`)); } // Total: 100 * 50ms = 5000ms // Pipelined: const keys = Array.from({length: 100}, (_, i) => `key-${i}`); const values = await redis.mget(...keys); // Total: 100ms ``` (e) Result: mget latency 100ms for 1000 items vs 50000ms sequential. (6) Optimization #4: compression (reduce payload): (a) If values are large (>5KB), compression saves serialization time. (b) Use gzip compression: 80% reduction in payload size, slight CPU cost for compression. (c) Benefit: if payload reduced 80%, parsing time reduced ~80%. (d) Example: 100MB response → 20MB (gzip). Transmission 100ms → 20ms. (e) Trade-off: compression CPU is 5ms, savings 80ms. Net: 15ms faster. (7) Recommended optimizations (in order): (a) First: pipelining (mget). 50x throughput improvement. Cost: code change. (b) Second: binary serialization (MessagePack). 1.5x latency improvement. Cost: dependency. (c) Third: connection pooling (ensure persistent connections). 5-10% improvement. Cost: config. (d) Fourth: compression (if payloads >5KB). 1.5-2x improvement. Cost: CPU. (8) Monitoring: (a) CloudWatch metric: client-side latency (in application logs). (b) Redis SLOWLOG: queries taking >1000µs. Check for expensive operations (SCAN, KEYS on large keyspace). (c) Expected: after mget + MessagePack, latency 20-30ms P99. (9) Cost-benefit: pipelining changes code (1 day dev), saves 50x latency (competitive advantage for real-time features). ROI: high.

Follow-up: You implemented mget + MessagePack. Latency dropped to 30ms. But one service is experiencing timeout (>5 sec latency on single GET). Other services are fine. How do you isolate the slow client?

Your Memcached cluster is 40% utilized (plenty of free memory), but hit rate dropped from 80% to 60% this week. You didn't add new data. Cache should be stable. Check: is this a tuning issue, a client code bug, or something else?

Investigate Memcached cache hit rate drop with excess memory: (1) Hit rate drop (80% → 60%) with 40% free memory suggests client-side issue, not cache sizing. (2) Potential causes: (a) Increased request volume to uncached keys. (b) TTL changes (keys expiring sooner). (c) Client hash ring rebalancing (consistent hashing bug). (d) Cache invalidation storm (mass delete). (e) Client code regression (not setting cache). (3) Diagnosis steps: (a) Memcached stats: `echo 'stats' | nc 11211`. Output: ```Get_hits: 1000000 Get_misses: 400000 (80% hit rate) Evictions: 0 (no evictions = memory not full, so memory is not the bottleneck) ``` After the drop: ```Get_hits: 800000 Get_misses: 500000 (60% hit rate, 100K additional misses) Evictions: 0 (still no evictions) ``` Conclusion: Evictions unchanged, hit rate dropped. Memory is not the issue. (b) Query traffic analysis: check if request volume changed. If 1M requests/day → 1.5M requests/day, more cache misses are expected. (c) Key distribution: check if newly added keys are uncached. Example: if adding new product IDs (uncached by definition), hit rate drops. (d) Client-side tracing: log which keys are accessed. Identify newly accessed keys (uncached). (4) Common fixes: (a) Client hash ring rebalancing issue: (i) When Memcached nodes are added/removed, consistent hashing shifts. All keys rehash. (ii) Redeployed code that changed hash ring? (iii) Fix: detect rebalancing in client, force cache invalidation immediately after (reduces stale hits). Or use sticky hashing (hash to same node within a window). (b) TTL regression: (i) Code change reduced TTL from 3600s to 60s? Keys expire faster. (ii) Check code diff: grep for cache TTL constants. (c) Cache-aside bug: (i) Code sets cache on write, reads from cache on read. (ii) Regression: forgot to update cache after write? Now reads always miss. (iii) Example: update product, forgot `cache.delete('product:123')`. Next read hits stale cache. (iv) Fix: add monitoring for cache invalidation calls. Alert if <100 invalidations/min (threshold). Triggers debugging. (d) Thundering herd: (i) Popular key expires, 1000 requests all miss, all query DB in parallel. (ii) Causes DB overload, slow responses. (iii) Fix: implement probabilistic expiration or cache warming (refresh before expiration). (5) Monitoring solution: (a) Dashboard: plot hit rate + request volume over time. (b) Alert: if hit rate drops >10% in 1 hour (while volume unchanged), page on-call. (c) Correlate: check code deployments around the time of drop. (d) Memcached STATS tool: compare stats before/after deployment. (6) Common regression scenario: (a) Developer pushed code change that removes `cache.get()` call. (b) Accidentally replaced with direct DB query. (c) Hit rate drops immediately. (d) Root cause: grep code for cache.get(). If count decreased, found it. (7) Implementation: add cache hit/miss monitoring to application. Example (Python): ```python import redis_cache_client def get_user(user_id): start = time.time() try: value = cache.get(f'user:{user_id}') if value: metrics.increment('cache_hit') return value except: pass metrics.increment('cache_miss') value = db.query(f'SELECT * FROM users WHERE id={user_id}') cache.set(f'user:{user_id}', value, ttl=3600) return value ``` Track metrics: cache_hit and cache_miss. Alert on hit rate deviation. (8) Timeline: diagnosis (1 hour), fix (1 hour for rollback if regression found), test (1 hour). Total: 3 hours to restore 80% hit rate.

Follow-up: You found the culprit: code deployed 1 week ago changed cache TTL from 3600s (1 hour) to 60s (1 minute). Rolled back. Hit rate restored to 80%. But customer reports they see stale data (yesterday's product prices). How do you prevent stale cache while keeping hit rate high?

Your application caches database query results in Redis with 1-hour TTL. A product manager changes a product's price. Application queries the DB immediately (sees new price), returns it to customer. But customer's next request hits cache (old price, hasn't expired yet). This is confusing (price jumped back and forth). How do you ensure consistency without breaking cache?

Cache invalidation + TTL balance for data consistency: (1) Problem: TTL-only invalidation (1-hour TTL) is lazy. Data changes in DB but cache isn't updated until TTL expires. (2) Solution: event-driven invalidation (cache-aside pattern): (a) When price updates: application updates DB, immediately deletes cache key. (b) Next read: cache miss, re-query DB, cache updated with new price. (c) Benefit: consistency achieved within 10ms (cache delete latency). (d) Drawback: if update happens but cache delete fails, stale data persists. (3) Implementation (event-driven cache invalidation): ```python def update_product_price(product_id, new_price): # 1. Update DB (transactional) db.execute('UPDATE products SET price=? WHERE id=?', (new_price, product_id)) # 2. Delete cache (best-effort, can fail) try: cache.delete(f'product:{product_id}') except: logger.error(f'Cache delete failed for product {product_id}') # 3. Publish event (async, triggers invalidation on other services) sqs.send_message(Queue='product-updates', Body=json.dumps({product_id, new_price})) ``` (4) Event-driven propagation (for distributed systems): (a) Publish event to SQS/SNS when product updates. (b) Other services (caches, replicas) consume event, invalidate cache. (c) Benefit: invalidation propagates across all instances (not just local cache). (d) Consistency: eventual (all caches invalid within 100ms). (5) Fallback strategy (if cache delete fails): (a) Implement cache versioning: cache key includes version hash. (b) When invalidating, increment version. (c) Example: `product:123:v1` → `product:123:v2`. (d) Old version never accessed, new version updated. (e) TTL cleanup: periodically delete old version keys. (6) Hybrid TTL + invalidation: (a) Combine short TTL (5 min) + event-driven invalidation. (b) If invalidation succeeds: old cache gone, new data served (consistent). (c) If invalidation fails: stale cache serves for max 5 min (acceptable for most apps). (d) Best of both: invalidation handles normal case, TTL handles edge cases. (7) Implementation (best practice): ```python CACHE_TTL = 300 # 5 minutes def get_product(product_id): cache_key = f'product:{product_id}' # Try cache first try: value = cache.get(cache_key) if value: return json.loads(value) except: pass # Cache miss or error, query DB value = db.query(f'SELECT * FROM products WHERE id={product_id}') cache.set(cache_key, json.dumps(value), ttl=CACHE_TTL) return value def update_product(product_id, data): # Update DB db.execute(f'UPDATE products SET ... WHERE id={product_id}', data) # Invalidate cache (best-effort) try: cache.delete(f'product:{product_id}') except: pass # TTL will eventually expire (max 5 min stale) # Publish event for other services sns.publish(Topic='product-updates', Message=json.dumps({product_id})) ``` (8) Testing strategy: (a) Unit test: update product, verify cache key deleted. (b) Integration test: update product, query immediately, verify latest price returned. (c) Chaos test: simulate cache delete failure, verify TTL saves the day (no permanent stale data). (9) Monitoring: (a) Track cache invalidation success rate. Alert if <99%. (b) Compare prices served from cache vs DB queries (spot-check). Alert if deviation >5% over 1 hour. (c) Customer complaints: correlate with cache/invalidation failures. (10) Cost: event-driven invalidation adds operational complexity (~20% more code), but delivers consistency without long TTLs. Worth it for price/inventory data (consistency-sensitive).

Follow-up: You implemented event-driven invalidation. Price update → cache delete succeeds. But between update and delete (100ms race condition), customer's request reads DB (new price) while other customer reads cache (old price). Inconsistent responses. How do you handle this micro-race condition?

Your Redis cluster has 3 nodes (1 primary, 2 replicas). Primary node dies. Failover takes 30 seconds. During failover, 1K requests/sec are buffered in client connection pool. When failover completes, all 1K buffered requests flood the new primary at once. New primary (promoted replica) is overloaded, latency spikes to 10 sec. How do you smooth failover traffic?

Graceful failover with traffic smoothing for Redis: (1) Problem: buffered requests on client side flood new primary during failover. (a) Primary down: clients buffer requests in connection pool (~30 sec of buffer = 30K requests). (b) Failover completes: all buffered requests released at once (thundering herd). (c) New primary can't handle 30K simultaneous requests → latency spikes → timeouts → errors. (2) Solution: implement client-side request backpressure + gradual ramp: (a) Client detects connection loss (primary unavailable). (b) Instead of buffering requests, start rejecting them gracefully: return "service unavailable" or queue with bounded size. (c) After failover completes, gradually resume. (3) Implementation (Redis client library, e.g., StackExchange.Redis): ```csharp var options = ConfigurationOptions.Parse("redis.example.com"); options.ConnectTimeout = 5000; options.SyncTimeout = 5000; options.ConnectionString = "redis.example.com,connectTimeout=5000,syncTimeout=5000"; // Custom retry policy to limit thundering herd options.RetryPolicy = new LinearRetry(5000); options.CommandMap = CommandMap.Create(); // Create connection var conn = ConnectionMultiplexer.Connect(options); // Graceful degradation on failover try { var value = await db.StringGetAsync(key); } catch (RedisConnectionException) { // Primary down, don't buffer. Return error to client immediately throw new ServiceUnavailableException("Cache unavailable, please retry"); } ``` (4) Application-side backpressure: (a) When Redis unavailable, fail fast (don't buffer). (b) Return 503 Service Unavailable to API client. (c) Client (e.g., web browser) retries with exponential backoff. (d) Benefit: distributes retries over time instead of thundering herd. (e) Examples: (i) Browser retry after 1 sec, then 2 sec, then 4 sec. (ii) Requests spread over 30 sec failover window, not all at once. (5) Circuit breaker pattern (to avoid overload): (a) After detecting primary down, open circuit. (b) Circuit open: reject all cache requests immediately (fast fail). (c) Every 5 sec, attempt health check: PING primary. (d) When health check succeeds, gradually ramp up: allow 10% of requests through, then 25%, then 50%, then 100%. (e) Ramp-up duration: 10-30 sec (smooth recovery). ```python class CircuitBreaker: def __init__(self): self.state = 'closed' self.failure_count = 0 self.last_check = time.time() def execute(self, func, *args): if self.state == 'open': # Check if enough time has passed if time.time() - self.last_check > 5: # health check try: func(*args) self.state = 'half_open' self.failure_count = 0 except: self.last_check = time.time() else: raise CircuitBreakerOpen() elif self.state == 'half_open': # Ramp up: allow 10-100% of requests try: if random.random() < (1 - self.failure_count / 100): # Allow request func(*args) self.failure_count = max(0, self.failure_count - 1) if self.failure_count == 0: self.state = 'closed' except: self.failure_count = min(100, self.failure_count + 10) else: # closed, allow all try: func(*args) except: self.state = 'open' self.last_check = time.time() self.failure_count = 1 ``` (6) Redis Cluster topology (automatic failover): (a) Use Redis Cluster (not just primary + replicas). (b) Redis Cluster does automatic failover + resharding. (c) Client library (redis-py, StackExchange.Redis) handles failover transparently. (d) Benefit: no manual circuit breaker needed (built-in). (7) Monitoring: (a) CloudWatch metric: failover events. (b) Alert: if failover occurs >2x per month (indicates instability). (c) Track request rejection rate during failover. Goal: <1% of requests rejected (rest succeed or retry). (8) Testing: (a) Kill primary Redis node. Observe: (i) How long until failover detected (30 sec)? (ii) How many requests are rejected? (iii) Latency spike magnitude? (b) Load test during failover: simulate 1K req/sec, kill primary, measure latency. (c) Goal: latency spikes <1 sec (acceptable) vs current 10 sec (unacceptable). (9) Cost: circuit breaker logic adds ~100 lines of code. Benefit: prevents cascading failures (failover incidents reduced 50%). ROI: high.

Follow-up: Circuit breaker is working, traffic ramped smoothly during failover. But one service doesn't respect the circuit breaker (hardcoded no retries). Requests fail during failover. How do you enforce circuit breaker across all services?