Redis Interview — Distributed Locks and Redlock Algorithm

Your distributed lock (key: "lock:payment-123") is acquired with TTL 10 seconds. A long-running operation (payment processing) takes 15 seconds. The lock expires after 10 seconds, allowing a competing process to acquire it. Both processes now execute the payment simultaneously—double charge! How do you prevent this with distributed locks?

This is the lock expiration vs operation duration mismatch. Solutions: (1) extend lock TTL: estimate operation time (15s) and set lock TTL to 2x that (30s). But if operation takes 40s, you're back to square one. (2) implement lock refresh: spawn a background thread that renews the lock every 5 seconds while operation is running. EXTEND lock TTL: SET lock:payment-123 EX 10 (same token ensures only the lock holder can renew). Check that renewal succeeds before continuing operation. (3) use Redlock (distributed lock across 3+ Redis instances): acquire lock on majority (2 out of 3) with TTL 30s. Even if 1 Redis crashes, lock is preserved on 2 others. Pseudo-code: EVAL 'if redis.call("SET", KEYS[1], ARGV[1], "NX", "EX", ARGV[2]) then return 1 else return 0 end' 1 lock:payment-123 30. (4) idempotent operations: design payment to be idempotent—if double-charged, second attempt detects it (via unique payment ID) and returns success without re-charging. Implementation: acquire lock, check if payment_id already exists in audit log. If exists, return success. If not, process payment and log it. Verify: test with redis-benchmark --eval lock-test.lua 0 and measure lock acquisition time. Monitor lock holder identity with redis-cli RANDOMKEY and DUMP to audit who holds locks.

Follow-up: If lock refresh fails (Redis unreachable), how should the operation behave to prevent double-execution?

You implement Redlock with 3 Redis instances for critical operations. Redlock acquires lock on 2 out of 3. But during acquisition, a network partition isolates 1 Redis instance from the other 2. Both client-1 and client-2 independently acquire the lock (each believing they have majority). Both think they own the lock and execute critical operations. Split-brain with locks! How is this possible with Redlock?

This is a subtle Redlock failure: if the partition happens AFTER lock acquisition, both sides can think they have the lock. Scenario: (1) client-1 acquires lock on instances {A, B} (majority). (2) network partition occurs: partition P1 = {A, B}, partition P2 = {C}. (3) client-1 holds lock, operates normally. (4) client-2 tries to acquire lock, can only reach instance C (single instance), fails (not majority). This is correct behavior—client-2 waits. (5) if instead client-2 reaches instances {A, B} before partition (because network is flaky), client-2 also acquires lock. Both now hold lock. Prevention: (1) Redlock assumes clocks are synchronized (within ms). Use NTP to sync all nodes. (2) increase lock validity: set TTL such that even in worst case (maximum clock drift + network delay), lock expires safely. (3) use Zookeeper or Consul instead of Redlock if you need strong consistency (multi-round consensus). For immediate detection: (1) implement optimistic locking: each process writes metadata (start-time, process-id) when acquiring lock. SCAN all locks periodically and alert if two processes claim same lock. (2) use unique token per client: when acquiring lock SET lock :. When releasing, verify token matches: if SET lock EX 10 NX returns nil (someone else acquired it), abort operation. Test with chaos-engineering: introduce network delays/partitions between Redis instances and verify lock exclusivity. Run redis-cli MONITOR on all 3 instances to observe lock acquisition order.

Follow-up: Given Redlock's limitations, when would you recommend Zookeeper over Redis for distributed locks?

Your lock-releasing code uses UNLINK lock:job-123 to quickly delete the lock. But the UNLINK command is asynchronous (runs in background). Between UNLINK returning and the lock actually being deleted (50ms later), client-2 acquires the same lock. Both client-1 (holding "old" lock) and client-2 (holding "new" lock) are executing. Why is async delete dangerous here?

Async delete (UNLINK) returns immediately but doesn't guarantee the key is gone. This creates a race window. Prevention: (1) use synchronous delete: DEL lock:job-123 (blocking, ensures lock is gone before return). (2) verify release: after DEL, run EXISTS lock:job-123 to confirm. If it still exists (due to network/timing), retry. (3) use Lua script for atomic release: EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("DEL", KEYS[1]); return 1 else return 0 end' 1 lock:job-123 . This atomically verifies ownership and deletes, preventing accidental deletion of someone else's lock. (4) implement graceful lock transfer: instead of deleting, set TTL to 0 (expires immediately). This is slightly safer than DEL. For the race window: (1) client-2 should verify lock ownership by reading the token: GET lock:job-123. If token != client-2's expected token, another client owns it. Backoff and retry. (2) use versioned locks: lock:job-123:v1, lock:job-123:v2. When releasing, write new version instead of deleting: SET lock:job-123:v2 . Each client always targets current version. Verify fix: run redis-benchmark --eval lock-release-test.lua and measure time between UNLINK return and actual deletion using MONITOR.

Follow-up: If you use the Lua script approach, how do you prevent the lock token from being spoofed or guessed?

You're using Redis locks to coordinate batch jobs. Job-A acquires lock "batch:process". While executing, the lock TTL expires (job takes 12 seconds, lock is 10 seconds). Job-B acquires the lock and starts processing overlapping data. Job-A still has 8 seconds left to execute. After Job-A finishes, it releases the lock (deletes key). Job-B is still running but now loses its lock silently. Data corruption results. How do you fix this?

The problem: Job-A expired its lock but continued running, then released Job-B's lock. This is ownership ambiguity. Fix: (1) use unique tokens per lock holder. When acquiring, SET lock:batch:process : EX 10. When releasing, verify token: GET lock and check if it matches your client-id. Only DEL if match. EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("DEL", KEYS[1]); return 1 else return 0 end' 1 lock:batch:process :. (2) extend lock TTL while operation is running: spawn background thread that runs EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("EXPIRE", KEYS[1], ARGV[2]); return 1 else return 0 end' 1 lock:batch:process 10 every 5 seconds. If EXPIRE fails (lock expired or someone else holds it), abort operation immediately. (3) implement "lock guard": before each critical operation within Job-A, verify you still own the lock: GET lock and compare token. If mismatch, throw exception. (4) set aggressive lock TTL and monitor operation time: if job regularly takes 12s but lock is 10s, increase TTL to 20s or implement dynamic extension. For testing: redis-benchmark --eval job-overlap-test.lua 0 to simulate two jobs. Use MONITOR to verify lock ownership before each critical operation. Alert if lock TTL drops below 50% of expected operation time.

Follow-up: If Job-A crashes (SIGKILL) while holding the lock, how do you ensure the lock is released for Job-B to acquire?

You're implementing a mutex for critical section. Multiple threads within the same process compete for redis-lock. Thread-1 acquires lock, but the thread is paused for 5 seconds by the garbage collector (GC pause). Lock expires. Thread-2 acquires lock and enters critical section. Thread-1's GC unpauses—Thread-1 is now in critical section with Thread-2. Race condition! How do you prevent this?

GC pauses or OS scheduling delays are unpredictable. Lock TTL can't account for them (would be too long, defeating lock purpose). Better approach: (1) use in-process locks (mutexes) for within-process synchronization, and Redis locks only for between-process. Python threading.Lock for threads, C# lock() for threads, Java ReentrantLock for threads. Use Redis locks only for distributed processes (separate machines). (2) if you must use Redis locks for threads within same process: set very short lock TTL (1-2 seconds) but refresh aggressively (every 100ms) using a background thread. This ensures if a thread is paused >2 seconds, lock expires, and other threads detect it. However, if all threads are paused (e.g., GC pause affects whole JVM), this doesn't help. (3) implement heartbeat: each lock holder sends heartbeat to a monitor. If heartbeat stops, monitor revokes lock. Use redis-cli EXPIRE key on a background worker—if lock holder stops refreshing, lock expires. (4) accept the limitation: design system to tolerate lock loss. If Thread-1 loses lock and Thread-2 enters critical section, use compare-and-swap (CAS) or versioning to detect race: each thread writes a version number before exiting critical section. If version doesn't match expectations, something went wrong. Prevention: (1) move to language-native locks (Java ReentrantLock, Python threading.Lock) for same-process concurrency. (2) use Redis Streams or Pub/Sub for cross-process coordination instead of locks. (3) measure GC pause time in production: if GC pause > 5 seconds, Redis locks won't help anyway. Consider different language/runtime with lower latency (Go, Rust).

Follow-up: If you can't avoid GC pauses and must use Redis locks, how would you detect that you've lost lock ownership and safely abort?

You're using Redis locks with Lua script to ensure atomic release (release only if token matches). But during a network partition, the Lua script fails to execute (Redis unreachable). The lock remains on Redis side (held by the client that's now unreachable). After partition heals, lock is stale but still held. How do you recover without manual intervention?

Stale lock = lock that's been held too long (likely by dead process). Recovery: (1) set lock TTL to auto-expire. When acquiring lock: SET lock:resource EX 30 (30s TTL). Even if release fails, lock expires in 30s. No manual cleanup needed. (2) if TTL is problematic (too short = lock expires during valid operation, too long = stale lock lingers), use adaptive TTL: SET with initial TTL based on operation estimate, then renew periodically. (3) implement stale lock detection: run periodic scan (every minute): SCAN 0 MATCH lock:* and for each lock, check metadata (client-id, timestamp). If lock is older than 2x expected operation time, force release: DEL . (4) use Lua script with expiration check: EVAL 'local now = redis.call("time")[1]; local created = redis.call("hget", KEYS[1], "created"); if now - created > ARGV[1] then redis.call("del", KEYS[1]); return "expired" else if redis.call("hget", KEYS[1], "token") == ARGV[2] then redis.call("del", KEYS[1]); return "released" else return "owned-by-other" end end' 1 lock:resource . This allows forced expiration. Test with network partition: disable network between client and Redis, verify lock expires after TTL, then re-enable and verify client can acquire new lock. Monitor: alert if locks older than 10 minutes exist.

Follow-up: If you set a short TTL to auto-expire locks, how would you prevent legitimate operations from being interrupted by premature lock expiration?

Your lock implementation uses GETSET to atomically get old token and set new token. But GETSET is deprecated in newer Redis (replaced by SET with GET). During a migration, you have code using both GETSET and SET...GET. Inconsistency results in lock corruption. How do you ensure backward-compatible, correct lock release?

GETSET vs SET...GET are functionally identical but have different APIs. To avoid corruption during migration: (1) standardize on Lua script that works with both old and new Redis: EVAL 'local current = redis.call("get", KEYS[1]); if current == ARGV[1] then redis.call("del", KEYS[1]); return 1 else return 0 end' 1 lock:resource . Lua is version-agnostic. (2) if you must use raw commands, check Redis version and branch: redis-cli INFO | grep redis_version and call appropriate command. (3) wrap in client library: use ioredis, go-redis, Jedis which abstract GETSET/SET...GET away. (4) never mix GETSET and SET...GET for the same key—choose one and stick with it. Version your lock schema: lock:resource:v1 uses GETSET, lock:resource:v2 uses SET...GET. Migrate gradually: (a) add new code with v2, (b) run dual-write during transition (write to both v1 and v2), (c) remove v1 references, (d) clean up v1 keys. For the corruption that already occurred: (1) identify affected locks: scan all lock:* keys and check if they were modified by mixed commands (check Redis slowlog for GETSET and SET commands). (2) invalidate affected locks: DEL all affected locks and force clients to re-acquire. (3) verify fix: write integration test that acquires and releases lock, then verify it's truly deleted. Commit message: "fix: migrate lock release from GETSET to SET...GET to support Redis 6.2+".

Follow-up: If you have production locks in flight during the migration from GETSET to SET...GET, how would you safely roll back if the new code has bugs?

You deploy Redlock across 5 Redis instances. During a failover (1 instance goes down, a new one starts), Redlock client acquires lock on 3 out of 5 (majority). But the new instance has no persistence (fresh start), so if it crashes again, the lock is lost on that instance. Now only 2 out of 5 are holding the lock—below majority! Client thinks it has lock, but it doesn't. How do you ensure Redlock lock durability?

Redlock assumes all Redis instances are durable (persist locks to disk). If a new instance starts with no persistence, locks are lost. Fix: (1) enable persistence on all 5 Redis instances: CONFIG SET appendonly yes and CONFIG SET save "60 1000" (snapshot every 60s or 1000 ops). After persistence is enabled, locks survive instance restart. (2) for immediate reliability: use RDB for faster recovery: CONFIG SET save "10 1" (checkpoint every 10s with 1 change, trade-off for disk I/O). (3) verify persistence before starting Redlock: run redis-cli LASTSAVE on all 5 instances before deploying Redlock client. If any shows old timestamp, persistence isn't working. (4) during failover: check if failed instance had the lock. If new instance starts without the lock, other instances might retain it (if they restarted before losing lock data). Run SCAN 0 MATCH lock:* on all instances and verify majority still hold it. (5) implement grace period: when instance joins cluster, delay its participation in new lock acquisitions for 30 seconds (allow it to load persisted locks from disk). Use CLUSTER NODES or separate health check. To test durability: (1) acquire Redlock across 5 instances, (2) kill one instance (kill -9), (3) immediately verify lock still exists on remaining 4 (majority), (4) restart killed instance and verify lock is restored from disk. Run redis-cli BGSAVE and wait before restarting.

Follow-up: If you have a Redlock acquisition that holds the lock across 5 instances but 2 instances have failed replicas, how do you ensure lock persistence?