Redis Interview — Latency Debugging and SLOWLOG

SLOWLOG GET 10 shows 5 GET commands taking 100ms each during peak traffic. You check the data—each key is 10MB. No Lua scripts, no complex operations. Get is simple but slow due to size. How do you optimize?

GET on 10MB value takes time to: (1) read from disk/memory (I/O), (2) network transmission to client (network overhead dominates). 10MB * 8 bits / bandwidth determines latency. At 1Gbps network: 10MB = ~80ms (plus Redis processing 10-20ms). Total ~100ms is expected. Optimization: (1) compression: store compressed value. GET automatically decompresses on client. 10MB might compress to 2MB (5x). Latency: ~20ms network + decompression time. (2) split large values: instead of one 10MB key, split into 100 keys of 100KB each. Client can fetch in parallel or only fetch needed parts. (3) use local cache: if 10MB is static (rarely changes), cache on client-side. Set TTL in Redis, cache elsewhere (memcached, browser cache). (4) check if GET is really the bottleneck: run LATENCY DOCTOR, which identifies if network, CPU, or memory is limited. Also run redis-cli --latency-history to measure actual RTT. If network is saturated (>100ms RTT), issue is network, not Redis. Fix by: (a) upgrading bandwidth, (b) reducing value size via compression, (c) reducing request volume (client-side batching, caching). For debugging: (1) SLOWLOG GET 10 shows duration but not breakdown. Use redis-cli MONITOR to see exact commands and measure client-side. (2) use redis-cli --stat to watch network I/O (in bytes/sec). If I/O is saturated, network is bottleneck. (3) profile with: time redis-cli GET mykey > /dev/null. If wall-clock time >> redis-cli latency, network RTT is high.

Follow-up: If compressing values adds CPU overhead on GET, how would you measure if it's worth the network savings?

SLOWLOG GET shows LRANGE commands taking 200-500ms. Each command returns ~100K elements. Redis CPU is fine. Network is fine. The latency is pure processing time inside LRANGE. Is there a way to speed up LRANGE or should you change data structure?

LRANGE on 100K elements is O(N) and takes 200-500ms. This is expected: Redis must iterate through 100K elements and serialize to client. Bottleneck: (1) iteration time: O(N) where N=100K, ~2-5µs per element = 200-500ms. (2) serialization: Redis protocol serialization adds overhead. (3) no way to speed up LRANGE itself—it's fundamentally O(N). Alternatives: (1) use LPOS or LMOVE for specific operations (faster if only fetching subset). (2) split list into smaller chunks: store list:1, list:2, ..., list:10 each with 10K elements. Client fetches needed chunks. (3) use Sorted Set instead: if you need range by score, ZRANGE is same complexity (O(N)) but more efficient for range queries. (4) use Streams (Redis 5.0+): XRANGE is O(1) for bounded ranges (not O(N)). (5) pagination: instead of LRANGE 0 -1 (all elements), LRANGE 0 999 (first 1000), then LRANGE 1000 1999. Client fetches incrementally. For your scenario: (1) identify actual use case: if always fetching all 100K, LRANGE is unavoidable cost. (2) if only fetching subset, use LPOS: LPOS mylist "target-value" would find and return one element (much faster). (3) benchmark alternatives: test LRANGE (baseline), ZRANGE (if score-based), Streams XREAD (if event-based). Measure end-to-end latency including serialization. (4) monitor: run SLOWLOG GET 100 and filter for LRANGE. If frequency > 1/sec, optimize. If frequency < 1/min, acceptable (users won't notice ~200ms occasionally).

Follow-up: If you switch to pagination (LRANGE in smaller batches), how would you handle concurrent modifications to the list?

Your SLOWLOG shows occasional EVAL commands taking 5+ seconds. The script is deterministic and doesn't have infinite loops. But during garbage collection on the Redis server, the script timeout is hit (lua-time-limit exceeded) and script is killed. This breaks client logic. How do you design scripts to handle GC pauses?

Lua scripts can be interrupted by: (1) redis.call() timeout (network/server slowness), (2) lua-time-limit reached (CONFIG GET lua-time-limit, default 5 seconds). GC pauses on Redis server cause script to exceed timeout even if script itself is fast. Fix: (1) increase lua-time-limit if scripts are legitimately long: CONFIG SET lua-time-limit 10000 (10 seconds). Trade-off: broken scripts can also block longer. (2) break long scripts into smaller chunks: instead of 1 script with complex logic, run multiple shorter scripts. Each completes quickly (avoid timeout). (3) use non-blocking EVAL: if script times out, client catches error and retries. Implement with exponential backoff. (4) optimize script: remove unnecessary redis.call() inside loops. Use MGET/MSET bulk operations instead. (5) avoid script on low-resource systems: if Redis server has GC pauses (e.g., Java-based Redis alternative), scripting is unreliable. Use Cluster or different architecture. To diagnose: (1) run redis-cli LATENCY HISTORY command to detect GC pauses. Latency spikes >1 second during script execution = likely GC. (2) check CONFIG GET server info for GC settings. (3) use redis-cli MONITOR to see timing of scripts relative to other commands. If script starts, then pause, then times out, it's GC. Prevention: (1) monitor lua-time-limit errors: log when script is killed. Alert if >1 per minute. (2) test scripts with simulated load: redis-benchmark --eval script.lua 0 -c 100 to measure real timeout frequency. (3) for production: reduce max concurrent scripts if timeout frequency > 0.1%.

Follow-up: If you can't optimize the script and can't increase lua-time-limit, what architectural change would you make?

SLOWLOG shows KEYS commands occasionally taking 2-10 seconds. KEYS pattern matches ~100K keys. The latency is due to pattern matching on all keys in the database. You're using KEYS for monitoring (not in production app code). But the monitoring script runs every 10 seconds, blocking Redis during pattern matching. How do you monitor without blocking?

KEYS is O(N) where N = total keys in database. Scans all keys and tests pattern. On 100K keys, can take seconds. Worse, blocks Redis single-threaded event loop during scan. Better approach: (1) use SCAN instead: SCAN 0 MATCH pattern COUNT 10 returns 10 keys at a time, cursor-based iteration. Non-blocking—Redis yields between calls. (2) external monitoring: instead of KEYS in monitoring script, use redis-cli SCAN or custom client that paginates. (3) use INFO command: provides high-level stats without scanning. DBSIZE gives total keys count. KEYSPACE shows keys per database (INFO keyspace). Use for aggregate monitoring. (4) if you must pattern-match: use SCAN with COUNT limit. Example: SCAN 0 MATCH "user:*" COUNT 1000 returns up to 1000 keys. Multiple calls to complete scan. Implementation: (1) switch KEYS to SCAN in monitoring: change from redis-cli KEYS "pattern" to loop calling SCAN with MATCH. (2) reduce monitoring frequency: if monitoring every 10 seconds, reduce to every 60 seconds. (3) use background monitoring: spawn separate monitoring process (not in critical path) to run SCAN. Push results to separate system (Prometheus, Grafana). For your scenario: (1) test KEYS latency: redis-cli --profile to see command timing. (2) switch to SCAN and measure improvement. Expected: SCAN with COUNT 1000 takes <100ms per call, total scan in <1 second (10 calls). (3) verify monitoring doesn't block clients: run client load test while monitoring. Measure client latency with/without KEYS. Expected: SCAN has negligible impact, KEYS causes 2-10 second stalls.

Follow-up: If you're using SCAN to monitor key patterns, how would you detect if keys are being deleted faster than you can scan?

SLOWLOG reveals that during replication, SYNC commands (which initiate full resync) take 30-60 seconds. This blocks Redis and causes other clients to see <30 second stalls. Is this normal, and can you reduce SYNC latency?

SYNC is expensive: it triggers BGSAVE (create RDB snapshot), then sends entire snapshot to replica, then sends subsequent writes. Latency: (1) BGSAVE time: depends on dataset size. Larger dataset = longer BGSAVE. (2) network transmission: RDB sent to replica (time = RDB size / bandwidth). (3) if RDB is large (10GB+), SYNC can take minutes. To optimize: (1) use diskless replication: CONFIG SET repl-diskless-sync yes. Instead of saving RDB to disk, stream directly to replica. Avoids disk I/O, faster. (2) increase repl-diskless-sync-delay to batch replicas: if multiple replicas connect within this window, they all get same stream (amortize cost). CONFIG SET repl-diskless-sync-delay 5 (5-second window). (3) reduce RDB size: (a) use volatile-ttl eviction to remove expired keys before BGSAVE, (b) compress values, (c) split data into shards, each shard = smaller RDB. (4) add replicas gradually: don't add 10 replicas simultaneously. Each triggers BGSAVE. Add 1 per minute to distribute load. (5) use Cluster: resharding is incremental (not full SYNC). Smaller transfers. Prevention: (1) measure BGSAVE time regularly: check INFO persistence > rdb_last_save_time. If increasing, dataset is growing. Plan for future SYNC latency. (2) monitor replication lag: ROLE command shows slave_repl_offset vs replication offset. If delta > 1M, replica is behind (stalled during SYNC). (3) test SYNC in staging: simulate with large dataset, measure end-to-end latency. For production: (1) run SYNC during maintenance window if possible, (2) enable diskless replication to reduce latency by 50%+, (3) alert if SYNC > 10 seconds (threshold depends on SLA).

Follow-up: If diskless replication causes replica to OOM during SYNC, how would you recover?

Your production Redis sees p99 latency of 200ms, while p50 is <1ms. SLOWLOG shows only a few commands >100ms, but monitoring shows consistent p99 spikes. This suggests tail latency is caused by many small operations, not few slow commands. Where's the hidden latency?

Tail latency (p99) being much worse than average indicates: (1) traffic spikes: load bursts cause command queueing. Many small commands queue behind each other, each adds RTT. (2) GC pauses: not visible in SLOWLOG if below threshold, but they pause Redis between commands. p99 captures these. (3) network congestion: not Redis's fault, but client experiences higher latency. (4) OS scheduling: Redis thread preempted by other processes, causing latency. (5) missing SLOWLOG visibility: SLOWLOG has threshold (default 10,000µs = 10ms). Commands <10ms not logged. To diagnose: (1) use redis-cli --latency to measure client-side latency over time. Identify when spikes occur. (2) use LATENCY DOCTOR for Redis's diagnostic report. (3) use redis-cli --stat to monitor QPS and detect when spikes align with high load. (4) profile with: run redis-benchmark -c 100 (concurrent clients) and measure latency. Compare with redis-benchmark -c 1 (single client). If p99 is much higher with concurrency, it's queueing. (5) enable SLOWLOG with lower threshold: CONFIG SET slowlog-log-slower-than 1000 (1ms threshold) to capture more commands. Prevention: (1) reduce QPS per connection: limit concurrent clients. (2) use connection pooling on client to reuse connections (reduces overhead). (3) optimize commands: use pipelining to reduce RTT per command. (4) scale horizontally: add more Redis instances to distribute load. Implementation: (1) measure before/after: record p99 latency now, implement fix, remeasure. Expected improvement varies by root cause (10-50% typical). (2) use monitoring to alert: set alert if p99 > 50ms threshold. Investigate when triggered.

Follow-up: If you lower SLOWLOG threshold to 1ms, does it impact Redis performance due to extra logging overhead?

SLOWLOG repeatedly shows a specific key operation taking much longer than others (e.g., HGETALL user:big-user takes 500ms while HGETALL user:normal takes 5ms). This suggests the key itself is large or has specific properties. How do you analyze why one key is slow?

Key-specific slowness suggests: (1) key size: large values take longer to serialize. Check with MEMORY USAGE key. (2) encoding: inefficient encoding (e.g., bad ziplist configuration) causes slower ops. Check with OBJECT ENCODING key. (3) memory fragmentation: large keys scattered in fragmented memory, slower to access. (4) specific data: if hash has 1M fields vs 10 fields, HGETALL is 1M/10 = 100x slower. To diagnose: (1) MEMORY USAGE user:big-user vs user:normal. If big-user is 100x larger, latency difference is expected. (2) HLEN user:big-user to get field count. Compare with normal user. If big-user has 1M fields, HGETALL is slow by design. (3) OBJECT ENCODING on both keys. If encoding differs (one ziplist, one hashtable), that affects latency. (4) use SLOWLOG RESET and re-run operation to re-capture timing. (5) profile with CLIENT PAUSE 1000 to pause all clients, then run single HGETALL. Isolate latency. Prevention: (1) data design: don't store 1M fields in single hash. Split into shards (hash:1, hash:2, ...). (2) limit field count: enforce max 100 fields per hash. (3) use different structure: if storing 1M items, use Set (faster) or Stream (more efficient). (4) monitor: set alert if HGETALL > 100ms on any key. Alert indicates potential data design issue. Implementation: (1) audit large keys: redis-cli --bigkeys to find oversized keys. (2) measure: for each large key, HLEN, LLEN, SCARD to quantify. (3) refactor: split large keys into shards. Update client code to query multiple shards and merge results. (4) test: benchmark before/after refactor. Expected: p99 latency reduces 10-100x.

Follow-up: If you can't refactor a large key (due to API constraints), how would you optimize latency without changing data structure?