Your distributed lock (key: "lock:payment-123") is acquired with TTL 10 seconds. A long-running operation (payment processing) takes 15 seconds. The lock expires after 10 seconds, allowing a competing process to acquire it. Both processes now execute the payment simultaneously—double charge! How do you prevent this with distributed locks?
This is the lock expiration vs operation duration mismatch. Solutions: (1) extend lock TTL: estimate operation time (15s) and set lock TTL to 2x that (30s). But if operation takes 40s, you're back to square one. (2) implement lock refresh: spawn a background thread that renews the lock every 5 seconds while operation is running. EXTEND lock TTL: SET lock:payment-123
Follow-up: If lock refresh fails (Redis unreachable), how should the operation behave to prevent double-execution?
You implement Redlock with 3 Redis instances for critical operations. Redlock acquires lock on 2 out of 3. But during acquisition, a network partition isolates 1 Redis instance from the other 2. Both client-1 and client-2 independently acquire the lock (each believing they have majority). Both think they own the lock and execute critical operations. Split-brain with locks! How is this possible with Redlock?
This is a subtle Redlock failure: if the partition happens AFTER lock acquisition, both sides can think they have the lock. Scenario: (1) client-1 acquires lock on instances {A, B} (majority). (2) network partition occurs: partition P1 = {A, B}, partition P2 = {C}. (3) client-1 holds lock, operates normally. (4) client-2 tries to acquire lock, can only reach instance C (single instance), fails (not majority). This is correct behavior—client-2 waits. (5) if instead client-2 reaches instances {A, B} before partition (because network is flaky), client-2 also acquires lock. Both now hold lock. Prevention: (1) Redlock assumes clocks are synchronized (within ms). Use NTP to sync all nodes. (2) increase lock validity: set TTL such that even in worst case (maximum clock drift + network delay), lock expires safely. (3) use Zookeeper or Consul instead of Redlock if you need strong consistency (multi-round consensus). For immediate detection: (1) implement optimistic locking: each process writes metadata (start-time, process-id) when acquiring lock. SCAN all locks periodically and alert if two processes claim same lock. (2) use unique token per client: when acquiring lock SET lock
Follow-up: Given Redlock's limitations, when would you recommend Zookeeper over Redis for distributed locks?
Your lock-releasing code uses UNLINK lock:job-123 to quickly delete the lock. But the UNLINK command is asynchronous (runs in background). Between UNLINK returning and the lock actually being deleted (50ms later), client-2 acquires the same lock. Both client-1 (holding "old" lock) and client-2 (holding "new" lock) are executing. Why is async delete dangerous here?
Async delete (UNLINK) returns immediately but doesn't guarantee the key is gone. This creates a race window. Prevention: (1) use synchronous delete: DEL lock:job-123 (blocking, ensures lock is gone before return). (2) verify release: after DEL, run EXISTS lock:job-123 to confirm. If it still exists (due to network/timing), retry. (3) use Lua script for atomic release: EVAL 'if redis.call("GET", KEYS[1]) == ARGV[1] then redis.call("DEL", KEYS[1]); return 1 else return 0 end' 1 lock:job-123
Follow-up: If you use the Lua script approach, how do you prevent the lock token from being spoofed or guessed?
You're using Redis locks to coordinate batch jobs. Job-A acquires lock "batch:process". While executing, the lock TTL expires (job takes 12 seconds, lock is 10 seconds). Job-B acquires the lock and starts processing overlapping data. Job-A still has 8 seconds left to execute. After Job-A finishes, it releases the lock (deletes key). Job-B is still running but now loses its lock silently. Data corruption results. How do you fix this?
The problem: Job-A expired its lock but continued running, then released Job-B's lock. This is ownership ambiguity. Fix: (1) use unique tokens per lock holder. When acquiring, SET lock:batch:process
Follow-up: If Job-A crashes (SIGKILL) while holding the lock, how do you ensure the lock is released for Job-B to acquire?
You're implementing a mutex for critical section. Multiple threads within the same process compete for redis-lock. Thread-1 acquires lock, but the thread is paused for 5 seconds by the garbage collector (GC pause). Lock expires. Thread-2 acquires lock and enters critical section. Thread-1's GC unpauses—Thread-1 is now in critical section with Thread-2. Race condition! How do you prevent this?
GC pauses or OS scheduling delays are unpredictable. Lock TTL can't account for them (would be too long, defeating lock purpose). Better approach: (1) use in-process locks (mutexes) for within-process synchronization, and Redis locks only for between-process. Python threading.Lock for threads, C# lock() for threads, Java ReentrantLock for threads. Use Redis locks only for distributed processes (separate machines). (2) if you must use Redis locks for threads within same process: set very short lock TTL (1-2 seconds) but refresh aggressively (every 100ms) using a background thread. This ensures if a thread is paused >2 seconds, lock expires, and other threads detect it. However, if all threads are paused (e.g., GC pause affects whole JVM), this doesn't help. (3) implement heartbeat: each lock holder sends heartbeat to a monitor. If heartbeat stops, monitor revokes lock. Use redis-cli EXPIRE key
Follow-up: If you can't avoid GC pauses and must use Redis locks, how would you detect that you've lost lock ownership and safely abort?
You're using Redis locks with Lua script to ensure atomic release (release only if token matches). But during a network partition, the Lua script fails to execute (Redis unreachable). The lock remains on Redis side (held by the client that's now unreachable). After partition heals, lock is stale but still held. How do you recover without manual intervention?
Stale lock = lock that's been held too long (likely by dead process). Recovery: (1) set lock TTL to auto-expire. When acquiring lock: SET lock:resource
Follow-up: If you set a short TTL to auto-expire locks, how would you prevent legitimate operations from being interrupted by premature lock expiration?
Your lock implementation uses GETSET to atomically get old token and set new token. But GETSET is deprecated in newer Redis (replaced by SET with GET). During a migration, you have code using both GETSET and SET...GET. Inconsistency results in lock corruption. How do you ensure backward-compatible, correct lock release?
GETSET vs SET...GET are functionally identical but have different APIs. To avoid corruption during migration: (1) standardize on Lua script that works with both old and new Redis: EVAL 'local current = redis.call("get", KEYS[1]); if current == ARGV[1] then redis.call("del", KEYS[1]); return 1 else return 0 end' 1 lock:resource
Follow-up: If you have production locks in flight during the migration from GETSET to SET...GET, how would you safely roll back if the new code has bugs?
You deploy Redlock across 5 Redis instances. During a failover (1 instance goes down, a new one starts), Redlock client acquires lock on 3 out of 5 (majority). But the new instance has no persistence (fresh start), so if it crashes again, the lock is lost on that instance. Now only 2 out of 5 are holding the lock—below majority! Client thinks it has lock, but it doesn't. How do you ensure Redlock lock durability?
Redlock assumes all Redis instances are durable (persist locks to disk). If a new instance starts with no persistence, locks are lost. Fix: (1) enable persistence on all 5 Redis instances: CONFIG SET appendonly yes and CONFIG SET save "60 1000" (snapshot every 60s or 1000 ops). After persistence is enabled, locks survive instance restart. (2) for immediate reliability: use RDB for faster recovery: CONFIG SET save "10 1" (checkpoint every 10s with 1 change, trade-off for disk I/O). (3) verify persistence before starting Redlock: run redis-cli LASTSAVE on all 5 instances before deploying Redlock client. If any shows old timestamp, persistence isn't working. (4) during failover: check if failed instance had the lock. If new instance starts without the lock, other instances might retain it (if they restarted before losing lock data). Run SCAN 0 MATCH lock:* on all instances and verify majority still hold it. (5) implement grace period: when instance joins cluster, delay its participation in new lock acquisitions for 30 seconds (allow it to load persisted locks from disk). Use CLUSTER NODES or separate health check. To test durability: (1) acquire Redlock across 5 instances, (2) kill one instance (kill -9), (3) immediately verify lock still exists on remaining 4 (majority), (4) restart killed instance and verify lock is restored from disk. Run redis-cli BGSAVE and wait before restarting.
Follow-up: If you have a Redlock acquisition that holds the lock across 5 instances but 2 instances have failed replicas, how do you ensure lock persistence?