Redis Interview — Capacity Planning and Benchmarking

You baseline Redis performance with redis-benchmark: single connection, 100K SET commands, reports 50K QPS. Your production app runs at 100K QPS with 100 concurrent connections. Performance is fine today, but you're planning for 3x growth. Should you provision 3x Redis capacity, or is there more to consider?

redis-benchmark doesn't match production perfectly. Factors to consider: (1) concurrency: redis-benchmark -c 1 (single connection) shows best-case throughput. Real app with 100 concurrent connections might have different performance due to: (a) lock contention (Redis is single-threaded, so concurrency doesn't increase throughput, only latency), (b) network overhead (more connections = more OS resource usage), (c) pipelining: redis-benchmark pipelined by default. Real app might not pipeline. (2) command mix: redis-benchmark tests single command type (SET). Real app has mix (SET 30%, GET 50%, INCR 20%, etc.). Different commands have different latencies. ZRANGE on large set is slower than GET. (3) key distribution: redis-benchmark uses simple uniform key distribution. Real app might have hot keys or skewed distribution. Hot keys cause CPU contention. (4) data size: redis-benchmark uses small values (default 3 bytes). Real app has 1KB-10MB values. Large values increase latency. (5) replication lag: redis-benchmark doesn't replicate. Real production has replication overhead. Capacity planning: (1) measure production baseline: redis-cli INFO stats to get actual QPS (total_commands_processed / uptime). Monitor p50, p99 latency with CLIENT LATENCY. (2) conduct production-like benchmark: use redis-benchmark with same concurrency, command mix, data size as prod. (3) measure headroom: if current is 100K QPS at p99 < 10ms, and max capacity is ~200K QPS (before p99 > 50ms), you have 1x headroom. For 3x growth: (100K * 3 = 300K QPS) > 200K max, so 2x Redis capacity needed. (4) capacity overhead: Redis should run at max 70% capacity (optimal). Headroom for spikes, failover. For 300K QPS, deploy for ~430K capacity. (5) scale strategy: Cluster sharding (2x nodes) or vertical scaling (larger instance). Cluster is better for very high throughput. For immediate: (1) run redis-benchmark locally with -c 100 -q -n 1000000 to measure realistic throughput. (2) measure latency: redis-cli --latency-history to see tail latency. (3) project 3x growth: if latency increases >2x at 3x load, you need more capacity.

Follow-up: If redis-benchmark shows 50K QPS but production measures only 30K QPS, what's the gap and how do you bridge it?

Your Redis cluster has 64GB RAM, 32 vCPU, currently using 50GB memory and 60% CPU. You're at "comfortable" capacity now. Projecting 1 year forward, you estimate memory demand will be 150GB (3x). You can't just add 100GB to existing node (too expensive). Should you add more nodes (Cluster sharding) or upgrade instance?

Two options, each with trade-offs: (1) vertical scaling (upgrade instance): 64GB -> 256GB node. Single larger node. Pros: simpler (no sharding logic), no resharding downtime, familiar ops. Cons: cost increases 4x (large instances are expensive per-unit), single point of failure (node failure = total outage until failover), limit to single machine (max ~1TB typically). (2) horizontal scaling (Cluster with sharding): add 4 nodes (64GB each). 4 nodes = 256GB total. Pros: better availability (node failure = 1/4 outage, not 100%), throughput scales (4 nodes process in parallel), more resilient. Cons: complexity (resharding, rebalancing), client code must support Cluster, operational overhead (manage 4 nodes vs 1). Recommendation depends on: (1) if memory >200GB and growing further: Cluster is better. (2) if staying <256GB: vertical scaling is simpler. (3) if performance (CPU) is bottleneck: Cluster helps (parallel processing). If CPU not bottleneck: vertical scaling sufficient. (4) if you need HA (high availability): Cluster is better. For your scenario: (1) measure current growth rate: 50GB now, project to 150GB in 1 year = 100GB/year. Check if growth is linear or accelerating. (2) plan for headroom: assume 200GB in 1.5 years. (3) decide: if 200GB <= 256GB (one large node), go vertical. If 200GB > 256GB (need multiple nodes), go Cluster. (4) consider cost: get quotes for: (a) one 256GB node, (b) four 64GB nodes. Compare total cost of ownership (TCO) over 3 years (instance cost + ops cost). Prevention: (1) implement data archival: delete old data (>1 year) to reduce growth. TTL on keys. (2) compress values: reduce memory footprint. (3) sharding at application layer: even without Redis Cluster, app can shard data (key-space sharding). Each shard = smaller instance. Reduces per-instance memory. Implementation for your decision: (1) benchmark: run redis-benchmark on 256GB node (if available in staging) and on 4-node Cluster. Compare latency/throughput. (2) pilot: if considering Cluster, test resharding in staging. Measure time and impact. (3) plan: document chosen approach, timeline, testing procedure. (4) execute: migrate during maintenance window, verify consistency after migration.

Follow-up: If you choose Cluster sharding but data distribution is uneven (some shards 3x larger than others), what's the remediation?

Your Redis instance is configured for 100K connections max (CONFIG maxclients). You're seeing 95K connections currently. You project 2x growth in clients (190K connections). Upgrading to 200K max is expensive (larger instance). Besides increasing maxclients, what's your capacity planning strategy?

Connections scale with instance size (resource cost). 190K connections = 3-4x instance cost. Better approach: (1) connection pooling on client-side: instead of each user/request opening connection, use connection pool (10 connections serve 1000 users). Reduces connection count 100x. (2) reduce keep-alive timeout: if clients hold idle connections, disconnect after 60 seconds: CONFIG SET timeout 60. Idle clients reconnect when needed. (3) use proxy/multiplexer: deploy Redis proxy (e.g., PgBouncer-like tool for Redis, or Envoy) that multiplexes many client connections to few Redis connections. (4) connection pipelining: clients batch commands (10 commands per round-trip instead of 10 round-trips). Reduces connection lifetime. (5) migrate to Cluster: distributed across multiple nodes, each handles fewer connections. 190K / 4 nodes = 47.5K per node, within 100K limit. (6) graceful connection shedding: if connections approach limit, server sends NOTICE to clients to reconnect later, gradually reducing connection count. Implementation: (1) client-side pool: use redis-py with connection_pool parameter. Initialize with pool_size=10, but concurrency = pool_size * 100 (multiplexing). (2) configure timeout: CONFIG SET timeout 60 (disconnect idle connections after 60 seconds). (3) test: simulate 190K clients. Measure peak connections. (4) if peak < 100K (with pooling/timeout): current instance sufficient. (5) if peak >= 100K: deploy Cluster or proxy. For capacity planning: (1) measure current connections: INFO clients > connected_clients. Monitor p99 over 1 month. (2) project growth: if 95K now and 2x growth = 190K, but with pooling reduces to 19K, current instance sufficient. (3) implement pooling ASAP: 100x reduction is huge. (4) test: load test with 190K simulated clients, measure peak connections. (5) if still > 100K, add 2nd Redis instance (Cluster) or proxy. Cost: 2x instances < 1x 500K-connection instance.

Follow-up: If you use connection pooling but need per-connection state (e.g., WATCH for transactions), how does pooling affect this?

You want to estimate how many Redis nodes you need for a cache of 1B items (average 1KB each = 1TB). Your target latency is p99 < 10ms. Single Redis node: 256GB, 100K QPS. What's the minimum cluster size needed?

Capacity planning: (1) memory: 1TB / 256GB per node = 4 nodes minimum (+ replication overhead, so 8 nodes with replication). (2) QPS: if cache handles 1M QPS peak and single node does 100K QPS, need 10 nodes for throughput. (3) latency: p99 < 10ms requires fast response. With proper sharding and network, achievable on 10 nodes. (4) redundancy: for HA, add replica nodes. 10 primary + 10 replica = 20 nodes. Final capacity estimation: (1) memory bottleneck: 1TB with replication = 2TB total. 256GB per node = 8 nodes (4 primary + 4 replica). (2) throughput bottleneck: if QPS is 1M, 100K per node = 10 primary nodes. With replica = 20 nodes. (3) choose max: 20 nodes (throughput-limited). (4) latency check: 1B items / 10 nodes = 100M items per node. At 100K QPS per node, latency for lookup (HASH, binary search) = ~1-5ms. Network RTT = ~1-2ms. Total p99 = ~10ms (achievable). Verification: (1) run redis-benchmark on similar cluster setup (staging). Measure p99 latency. (2) if p99 > 10ms, increase nodes or optimize queries. (3) test failover: simulate node failure, measure impact on p99 latency. Should increase temporarily then recover. For implementation: (1) provision 20-node Cluster (10 primary + 10 replica). (2) shard using key prefix: app routes keys by prefix to shard (e.g., key:shard:i = shard i). (3) load test: inject 1B items, run 1M QPS load. Measure p99 latency. Tune if needed. (4) monitor: set alert if p99 > 15ms (buffer). If triggered, investigate (CPU usage, network, GC pauses) or add nodes.

Follow-up: If latency p99 is 15ms (exceeds 10ms target), what's the first optimization to try before adding nodes?

You're benchmarking Redis for a cost estimate. On a t3.xlarge AWS instance (4 vCPU, 16GB RAM), redis-benchmark shows 100K QPS. You extrapolate: 1M QPS = 10 instances = 10 * $X/month = $10X/month cost. But you're not accounting for replication, failover, network. What's missing from your capacity estimate?

Incomplete cost estimate. Missing factors: (1) replication overhead: replica = second instance. 10 primary + 10 replica = 20 instances (2x cost). (2) network: inter-node communication cost. If multi-AZ (for HA), cross-AZ traffic costs. Typical: $0.01/GB. At 100K QPS with 1KB values = 100MB/sec = $86K/month cross-AZ. (3) storage (backups): RDB snapshots for each instance. 16GB * 20 = 320GB. S3 storage cost. (4) management overhead: Cluster deployment/monitoring. Operational cost (DBA time, automation). (5) failover traffic: during failover, clients reconnect. Temporary spike in latency. Cost of user impact (e.g., retries, fallback to DB). Revised estimate: (1) instances: 10 primary + 10 replica = 20 instances. Cost: 20 * $X/month = $20X/month. (2) network: cross-AZ traffic. Assume 50% of traffic is cross-AZ (replication): 100MB/sec * 0.5 * 60 * 60 * 24 * 30 * $0.01/GB = $1.3M/month (!) This can dominate. (3) storage: 320GB RDB + logs. S3 at $0.023/GB = ~$10K/month. (4) CPU/memory: t3.xlarge at $0.1664/hour = ~$1.2K/month per instance. Total: 20 instances * $1.2K + network $1.3M + storage $10K = ~$1.35M/month. For cost optimization: (1) use single-AZ if acceptable (reduces cross-AZ traffic 90%): network cost becomes $130K/month. Total ~$150K/month. (2) use smaller instances if QPS allows: t3.medium (2vCPU) might do 50K QPS. 20 instances = $150K/month (vs t3.xlarge $240K). (3) use managed Redis (AWS ElastiCache): includes management, replication, automatic failover. Cost ~$150-200K/month for equivalent capacity. (4) benchmark more carefully: measure actual QPS needed, not assume 1M. If real peak is 500K, use 5 primary nodes (10 with replica). Cost savings: 50%. Implementation: (1) create detailed cost model: instance type, count, network, storage, management. (2) benchmark: measure real peak QPS in staging. Use load testing tools (Locust, k6). (3) optimize: switch to managed services, reduce AZs, smaller instances. (4) validate: compare cost estimate vs actual AWS bill. Alert if diverge >10%.

Follow-up: If your cost estimate is $150K/month but budget is only $20K/month, what architectural changes would you consider?