Redis Interview — Lua Scripting and Atomicity

You're implementing atomic bank transfer: debit account A, credit account B. You run two separate Redis commands: DECRBY account:A 100, then INCRBY account:B 100. Between the two commands, a network error occurs and the second command never reaches Redis. Account A is debited but B isn't credited—money vanishes! How does Lua scripting prevent this?

Lua scripts are atomic: all commands in the script run together, indivisible. Even if network dies midway through script execution, either the entire script completes or none of it does. Use EVAL: EVAL 'redis.call("DECRBY", KEYS[1], ARGV[1]); redis.call("INCRBY", KEYS[2], ARGV[1]); return 1' 2 account:A account:B 100. Redis holds lock during script execution—no other client can interleave. If network dies during script execution on the client side (before receiving response), the script still completes on Redis side. To verify: run redis-cli EVAL ... and kill the connection with CTRL+C while script is running. Restart redis-cli and query account balances—transfer will have completed. Atomicity guarantee: (1) no client can read partial state (A is debited but B isn't), (2) script always either fully succeeds or fully fails, (3) ACID properties within script scope. However, Lua doesn't protect against crash of Redis itself. If Redis crashes midway: (1) use persistence (AOF) to recover. AOF logs the EVAL command, and on restart, the script is re-executed atomically. (2) idempotent scripts: if script runs twice, same result. Use unique ID: EVAL '... SET accounting:transfer: done' to prevent double execution. Test: write script that checks account balance before and after, verify consistency.

Follow-up: If the Lua script calls a command that's very slow (e.g., FLUSHDB on a large database), how would you prevent the entire Redis from hanging?

Your Lua script hits an error: redis.call("LRANGE", "mylist", 0, -1) returns a list, but the script tries to access a hash field: local val = result["field"]. Script crashes. All subsequent commands from clients are stalled while Redis tries to handle the script error. SLOWLOG shows "script killed" errors. How do you debug and fix?

Lua type errors crash the script, and while the script is running, Redis can't process other commands (blocking everyone). Debug: (1) test script locally: install redis-cli with --eval flag or use redis-lua-debugger. Run redis-cli EVAL script.lua 0 and inspect error output. (2) use redis-call error handling: EVAL 'local result = redis.call("LRANGE", "mylist", 0, -1); if type(result) == "table" then ... else redis.error_reply("wrong type") end' 0. (3) use pcall instead of redis.call to catch errors gracefully: EVAL 'local ok, result = pcall(function() return redis.call("LRANGE", "mylist", 0, -1) end); if not ok then return {err = result} else ... end' 0. This prevents script crash and allows retry. To fix the stalled clients: (1) run SCRIPT KILL on Redis CLI from another client connection to forcibly abort the runaway script. (2) if SCRIPT KILL doesn't work (script is in middle of redis.call), you must SHUTDOWN or kill -9 the Redis process. (3) prevent by: (a) test scripts thoroughly before production, (b) set script timeout: CONFIG SET lua-time-limit 5000 (5 seconds) so slow scripts auto-abort, (c) add error checking in script. Verify: run redis-benchmark --eval script.lua 0 -c 100 to load test with multiple clients and ensure script doesn't block.

Follow-up: If a script repeatedly times out (lua-time-limit exceeded), how would you optimize it without rewriting the entire script?

Your Lua script uses redis.call inside a loop: for i=1, 1000 do redis.call("SET", "key"..i, "value") end. The script runs for 5 seconds, blocking all clients. A monitoring system detects the block and alerts: "Redis under heavy load". You want to keep atomicity but reduce blocking time. What's the tradeoff?

Lua scripts block Redis single-threaded event loop. Long scripts = no other clients execute commands. This isn't an atomicity problem (all 1000 SETs complete atomically), but a fairness/latency problem (other clients starve). Tradeoff: (1) break into multiple smaller scripts: instead of 1 script with 1000 SETs, run 10 scripts with 100 SETs each. Each is separately atomic, but state between transitions is partial (not fully atomic across all 1000 keys). Good for use cases where partial updates are acceptable (e.g., caches). (2) use MSET instead of Lua loop: redis.call("MSET", "key1", "val1", ..., "key1000", "val1000"). Still 1 atomic operation but much faster (no script overhead). Limitation: MSET has command length limit (~512MB), and you need to construct the entire payload on client-side. (3) use pipelining client-side: send 100 SETs in batch without waiting for each. Not atomic across 100 commands but atomic per command. Fastest for non-transactional workloads. (4) Redis Cluster sharding: shard keys across multiple nodes, execute script in parallel. Each node processes subset. Requires rewriting application logic. Recommendation: use MSET for bulk SET operations (fastest), use client-side pipelining for non-transactional workloads (acceptable consistency), use Lua scripts only when true atomicity across multiple keys is required. Monitor with: redis-cli --latency-history to detect script blocking, and redis-cli SLOWLOG GET to identify slow commands. Alert if slowlog > 100ms.

Follow-up: If you can't avoid a long-running Lua script, how would you prevent it from blocking critical operations like PING health checks?

You deploy a Lua script to increment a counter, but during rolling update, you deploy new script with different SHA1 hash. Old clients (using old script SHA) and new clients (using new SHA) run simultaneously. Both claim they increment the same counter atomically, but their atomic blocks don't overlap—non-deterministic behavior results. How do you safely version Lua scripts?

Different script SHAs = different atomic blocks = no cross-version atomicity. Scenario: old script does SET counter +1, new script does SET counter +2 (algorithm changed). Both run on same counter key, producing inconsistent results. Prevention: (1) use SCRIPT LOAD to pre-register versions: load new script first with redis-cli SCRIPT LOAD < new-script.lua, get SHA1. Update client config to use new SHA. Deploy clients gradually (blue-green deployment). Once all clients use new SHA, old script can be removed with SCRIPT FLUSH. (2) versioning strategy: append version to key: counter:v1, counter:v2. Each script version operates on its own key. After migration, RENAME counter:v1 counter to merge. (3) backward compatibility: ensure new script produces same result as old. Run both scripts on test data and compare output before deploying. (4) use SCRIPT EXISTS to check availability: redis-cli SCRIPT EXISTS returns 0 if not loaded. Client can reload if needed. Implementation for blue-green: (1) on Redis, run SCRIPT LOAD for new script (don't delete old). (2) deploy new client version with new SHA in config. (3) monitor both counters (counter:v1 and counter:v2) and alert if they diverge. (4) once all traffic is on new version, clean up old version with SCRIPT DELETE . Test with: run both scripts concurrently and verify final counter is consistent.

Follow-up: If you deployed a buggy new script and need to immediately roll back without deploying client code, how would you do this?

Your Lua script acquires a distributed lock using SET with NX and EX, then does work, then releases the lock with DEL. But if the script times out (lua-time-limit exceeded) midway through the DEL, the lock is partially released (not fully deleted). Clients trying to acquire the same lock see inconsistent state. How do you ensure lock release is atomic?

The problem: if script timeout happens during DEL, the script is killed, leaving lock in ambiguous state. Solution: (1) implement lock release atomically within the same script: EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("DEL", KEYS[1]); return 1 else return 0 end' 1 lock:resource . This is atomic—either token matches and lock is deleted, or token doesn't match and nothing happens. (2) increase lua-time-limit before lock acquisition to ensure release has enough time: CONFIG SET lua-time-limit 10000 (10 seconds) if lock operation typically takes <5 seconds. (3) use separate short-lived scripts: instead of 1 long script, split into 3: acquire-lock, do-work (outside Redis), release-lock. Each is short and won't timeout. (4) implement lock heartbeat: spawn background thread during work that renews lock every 100ms. If script times out, lock expires naturally in TTL seconds (e.g., 30 seconds). This is safer than forcing immediate release. Implementation: EVAL 'redis.call("SET", KEYS[1], ARGV[1], "EX", tonumber(ARGV[2])); return 1' 1 lock:resource . Then run background: EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("EXPIRE", KEYS[1], ARGV[2]); return 1 else return 0 end' 1 lock:resource every 5 seconds. On script timeout, lock stays alive (heartbeat extends it). When work actually completes, DEL the lock cleanly.

Follow-up: If you can't modify the script (e.g., third-party code) and it times out frequently, what operational workaround would you implement?

Your Lua script iterates over a ZSET using ZRANGE and performs complex calculations on each member. The script is deterministic but each ZRANGE call returns results in different order based on Redis version or floating-point precision. This causes replication divergence: primary and replicas execute script and produce different final results. How do you ensure script replicates correctly?

Script non-determinism is dangerous in replication: primary computes X, replicas compute Y, inconsistency results. Causes: (1) ZRANGE order depends on Redis version's floating-point handling, (2) script uses redis.call("RANDOMKEY"), (3) script uses system time (redis.call("TIME")). Fix: (1) sort explicitly: EVAL 'local members = redis.call("ZRANGE", KEYS[1], 0, -1); table.sort(members); ...' to force deterministic order even if Redis returns unsorted. (2) avoid randomness: if RANDOMKEY is needed, make it deterministic—use SCAN with fixed cursor instead. (3) use ZRANGEBYSCORE with explicit range to ensure consistent ordering across versions. (4) test on both primary and replica: run script on both, compare output. Alert if divergence. Prevention: (1) require scripts to be deterministic (Redis actually enforces this for cluster mode but not standalone). (2) version scripts and test exhaustively before deploying to production. (3) use SCRIPT DEBUG YES on replica to single-step script and compare execution flow with primary. (4) run redis-cli MONITOR on primary and replica simultaneously and log all commands. If output diverges, check script execution. For already-diverged data: (1) restore replica from RDB snapshot of primary, then re-sync replication. (2) use redis-cli --raw --csv to export keys from primary and replica, diff them, and manually fix divergence.

Follow-up: If a Lua script calls redis.call("FLUSHALL"), can replicas be prevented from also executing this script?

Your app uses Lua scripts heavily: 10K scripts loaded across 100 Redis instances. Script cache is full (CONFIG GET script-flush-limit reached). New scripts can't load. SCRIPT FLUSH would clear ALL scripts, breaking all clients momentarily. You need to selectively unload old scripts without affecting clients. How do you manage script lifecycle?

Redis doesn't provide SCRIPT DELETE for individual scripts (only FLUSH for all or FLUSH ASYNC for non-blocking). Managing 10K scripts requires strategy: (1) implement script versioning: include version in SHA metadata. Old versions are tracked and candidate for removal. Use redis-cli SCRIPT EXISTS to check if script is in use before deciding to flush. (2) lifecycle management: track script usage with client-side metadata. Keep a Redis SET of "active-scripts": SADD active-scripts when loading, SREM when done. Periodically SCAN all loaded scripts (via MONITOR or EVALSHA failures) and compare with active-scripts. Remove inactive. (3) cap memory: CONFIG SET script-flush-limit 10mb (limits script cache). When reached, scripts fail to load until you manually FLUSH. Set up alerts when script-flush-limit approaches. (4) lazy loading: don't pre-load all scripts. Load scripts on-demand using SCRIPT LOAD before EVAL. Old scripts naturally evict as new ones load (LRU eviction within script cache). (5) for your scenario: run redis-cli SCRIPT FLUSH ASYNC to clear all scripts without blocking. This unloads all, so clients will lazily reload needed scripts using SCRIPT LOAD on next call. Clients should handle NOSCRIPT error by reloading: catch error, SCRIPT LOAD new script, retry EVAL. Implementation: SCAN 0 MATCH "active-scripts:*" to find all script tracking keys, then for each script in the set, check SCRIPT EXISTS. Remove from active-scripts if it no longer exists. Monitor script memory: INFO stats > eval_scripts to see total script bytes.

Follow-up: If you have a script that's used by 1000 concurrent clients and you want to upgrade it, how would you roll out the new version without dropping requests?

Your Lua script uses redis.call() but doesn't perform I/O to external systems (no HTTP calls). However, the script reads from Redis, performs 1000 operations, then writes results. If redis is_read_only (replica mode), redis.call("SET") inside the script fails with READONLY error. Scripts can't detect replica mode and adapt. How do you handle read-only replicas?

Replicas are read-only by default (replica-read-only yes). Scripts running on replicas that try to write will fail. Fix: (1) detect replica mode on client-side before executing write scripts: redis-cli INFO replication | grep role:slave tells you if it's a replica. Only execute read-only scripts on replicas. (2) use SCRIPT DISABLE-WRITES or similar: some Redis modules (Redis Enterprise, Redis Stack) provide mechanisms to detect/prevent writes on replicas. (3) execute write scripts only on primary: route all EVAL commands with writes to primary. Read-only scripts can execute on replicas. (4) use Lua conditional: EVAL 'if not redis.call("INFO", "replication")[1].role == "master" then return redis.error_reply("Replica is read-only") else redis.call("SET", ...) end' 0. This safely errors if executed on replica rather than crashing. (5) for read-heavy workloads: move read-only portion to replica, then send write requests to primary: EVAL-on-replica for reads, EVAL-on-primary for writes. Batch writes to amortize overhead. Implementation: check CLIENT LIST | grep ROLE on replica vs primary. Run EVAL scripts only on appropriate instance. Test with: (1) run redis-cli --replica mode and execute scripts, verify write scripts error gracefully. (2) measure latency on replicas (should be lower) vs primary (higher due to replication overhead).

Follow-up: If you have a complex Lua script that must execute on both primary and replica (for consistency), how would you ensure both produce identical results?