Redis Interview — Replication, Consistency, and Split Brain

Your Redis primary and replica are in sync (replication offset matches). You perform a failover, promoting the replica to primary. The new primary accepts writes while the old primary goes offline for maintenance. 2 hours later, the old primary comes back online and clients can still connect to it directly (not through service discovery). It's now accepting writes independently. You now have write traffic to both. How do you prevent split-brain?

This is split-brain: two primaries accepting writes independently. Prevent with: (1) client routing: never allow direct connections to both primaries. Use service discovery (Consul, Kubernetes endpoints) that routes to only one primary. (2) replica read-only: set replica-read-only yes on all replicas so they reject writes (after failover, old primary must be manually demoted). (3) sentinel/cluster: use Sentinel or Cluster to automate demotion. After failover, Sentinel will force the old primary to REPLICAOF the new primary when it comes back online. For immediate fix: (1) verify write traffic distribution: check which clients are hitting which primary using CLIENT LIST on both. (2) reconcile data: export keys from both primaries, identify divergence (keys that exist on one but not the other), then manually merge (keep latest write by timestamp). (3) force demotion: on old primary, run REPLICAOF . This makes old primary a replica and syncs writes from new primary. (4) verify consistency: check DBSIZE, INFO replication on both before allowing clients to resume. Prevention: implement write-ahead logging or CRDT (conflict-free replicated data types) to tolerate split-brain. For payment systems: use Redlock to coordinate writes across both primaries (only 1 can hold lock at a time). Monitor with: (1) ROLE command on all instances every 10s, alert if >1 primary, (2) write offset divergence: compare replication offset on primaries, alert if delta > 0.

Follow-up: If both primaries received writes during split-brain and the write sets diverge, which writes should you keep?

Your replica is consistently 100MB behind the primary in replication offset (not advancing). The replica is stuck in "partial resync" state and won't move to full resync. INFO replication shows second_replconf_ack is stalled. The replica can't catch up. Primary's replication backlog is full and old data is being discarded. How do you unstick this?

Partial resync is stuck, likely because the replica's replication offset is beyond the primary's backlog start offset (the incremental changes were discarded). First, check: redis-cli -p 6379 INFO replication | grep repl_backlog_first_byte_offset (on primary) and redis-cli -p 6380 INFO replication | grep slave_repl_offset (on replica). If slave_repl_offset < repl_backlog_first_byte_offset, partial resync is impossible; full resync is required. To fix: (1) increase replication backlog: CONFIG SET repl-backlog-size 100mb (increases buffer of changes kept for replicas). (2) on replica, trigger full resync: redis-cli -p 6380 REPLICAOF NO ONE (stop replication), then REPLICAOF 6379 (restart). This initiates full resync (slower but guaranteed to catch up). Monitor with INFO replication > sync_in_progress. (3) if primary is slow during resync (BGSAVE for snapshot): increase slave-priority to reduce load, or run BGSAVE manually during off-peak. (4) check network: if replication is over WAN, high latency (>100ms) can stall partial resync. Monitor with redis-cli --latency-history. Prevention: set repl-backlog-size to support >1 replica, use repl-backlog-ttl 3600 to keep changes for at least 1 hour. Verify replica is catching up: run redis-cli -p 6380 ROLE and check slave_repl_offset advancing every 10 seconds. Alert if slave_repl_offset == last_known_offset for >60 seconds.

Follow-up: During full resync, if the replica crashes midway, how would you resume without starting over from zero?

You have a primary in US-East and a replica in US-West (cross-region). Replication lag is 200ms (normal). A developer accidentally runs DEL on the primary. The deletion is replicated to US-West 200ms later. Clients in both regions lose access to the key. How do you design for this scenario without affecting normal operations?

This is data loss due to accidental deletion. Prevent with: (1) ACLs: restrict DEL command to operations team only. Use redis-cli --raw CONFIG GET requirepass to enable auth, then redis-cli ACL SETUSER appuser -@all +get +set +incr to allow only safe commands. (2) versioning: instead of DEL, mark keys for deletion (e.g., SET :deleted true with TTL 24 hours). Apps check this flag before reading. (3) immutable replicas: set replica-read-only yes, so only primary can write. For critical keys, use a second replica: primary writes to replica-A (read-write for ops), replica-B is read-only for apps. (4) soft deletes: archive data to S3 instead of deleting. Keep keys in Redis with version 1, version 2, etc. For immediate recovery: (1) if deletion just happened (within 200ms on primary), halt replication to US-West: redis-cli -h us-west-replica REPLICAOF NO ONE. (2) restore key from backup or from a different replica: RESTORE 0 (if you have the value dumped). (3) verify divergence: run DBSIZE on primary vs replica, should now be different. (4) sync replicas: after manual fix, REPLICAOF to resume replication. Prevention: run periodic backups of critical keys to a separate location (S3, DynamoDB). Test recovery: simulate DEL and verify restoration takes <5 minutes.

Follow-up: If you had a backup of the deleted key, how would you restore it using RESTORE without losing other changes made after deletion?

You're scaling your Redis setup: 1 primary, 2 replicas. Primary handles 100K QPS. After adding replica #3, you notice replication lag spikes to 2 seconds on replica #2 (normal is <10ms). The primary's CPU is fine, but network I/O is saturated. You're now losing writes from your app because replica reads are delayed. How do you fix?

Adding a replica increases primary's replication load: it must send replication stream to multiple replicas. If network is saturated, lagging replicas queue up. Check: INFO replication on primary and look for clients with cmd:psync. Monitor with: redis-cli CLIENT LIST | grep -i replica. Primary's network I/O limit (e.g., 1Gbps) is the bottleneck. Fix: (1) increase network bandwidth: scale to higher instance type (more network), or use multiple network interfaces (NIC bonding). (2) compression: set repl-diskless-sync-delay 5 to batch replication streams, or use repl-diskless-sync yes to stream RDB instead of disk (saves disk I/O but CPU cost). (3) reduce replica load: defer non-critical replicas. Mark 1 replica as read-heavy (lower priority): redis-cli CONFIG SET slave-priority 50 so Sentinel deprioritizes it during failover. This lets it lag without hurting recovery. (4) federation: create a replica-of-replica: primary -> replica-A -> replica-B. Replica-B replicates from replica-A, reducing primary's load. Configure: redis-cli -p 6380 REPLICAOF 6379 on replica-A, then redis-cli -p 6381 REPLICAOF 6380 on replica-B. (5) traffic shaping: on primary, set repl-backlog-size 100mb to allow buffering during spikes. Monitor: run redis-cli --bigkeys to identify large keys that cause slow replication. Verify fix: check replica-lag with ROLE command, alert if >100ms.

Follow-up: If you implement replica-of-replica, how do you handle failover? If replica-A dies, what happens to replica-B?

Your primary suddenly dies (SIGKILL). A replica is promoted by Sentinel. But 2 seconds before the crash, a client SET on the primary that was never replicated to the replica. After failover, the replica doesn't have this key, so users lose their session. How do you prevent this data loss?

This is the fundamental replication lag problem: writes on primary may not reach replicas before failover. Replication in Redis is asynchronous by default. Solutions: (1) synchronous replication: use min-replicas-to-write 1 and min-replicas-max-lag 10 on primary. Before accepting a WRITE, primary waits for at least 1 replica to acknowledge (REPLCONF ACK). If no replica acks within 10 seconds, write is rejected. This guarantees replicas have the write before primary acknowledges client. Trade-off: writes are slower (wait for replication RTT). (2) for session data: store in multiple replicas or use Redlock. Use two Redis instances: session-store-A (primary), session-store-B (standby). Write to both before acknowledging: SET on both, only succeed if both ACK. (3) detect loss: after failover, compare primary's offset before crash vs replica's offset at promotion. If delta > 0, some writes were lost. Query all clients and ask them to re-authenticate (accept session loss). (4) immutable snapshots: before failover, checkpoint critical keys to S3 with metadata (write-time, offset). After failover, replay from S3 if needed. Implementation: in app code, use Lua script: EVAL 'if not redis.call("exists", ) then ... write to backup-store ... end' to ensure dual-write. Monitor: use INFO stats > sync_partial_err to detect failed replication and alert.

Follow-up: If you enable min-replicas-to-write but a replica becomes slow, how does this impact your write throughput?

You have a complex replication topology: primary in AZ-1, replica-1 in AZ-2 (replicates from primary), replica-2 in AZ-3 (replicates from replica-1). A network partition isolates AZ-1 from AZ-2 and AZ-3. Now replica-1 can't reach primary, and replica-2 can't reach replica-1. Both replicas are stuck (can't receive replication). Meanwhile, primary is still accepting writes (isolated from replicas). After partition heals, you have massive inconsistency. How do you prevent this?

Cascading replication (primary -> replica-1 -> replica-2) amplifies partition failures. Prevent: (1) use star topology: all replicas connect directly to primary, not to each other. Configure: redis-cli -p 6380 REPLICAOF 6379 on replica-1 and replica-2. This ensures all replicas get updates at roughly the same time (one replication hop). (2) during partition: if AZ-1 (primary) is isolated, replicas in AZ-2/AZ-3 can't receive updates. Primary should stop accepting writes if replicas are unreachable: set min-replicas-to-write 2 on primary. If both replicas go down, primary rejects writes. (3) cross-AZ replication: use direct network links (ExpressRoute, Direct Connect) that bypass standard network, reducing partition risk. (4) multi-primary (active-active): use Redis modules (e.g., CRDT) to tolerate partitions, with automatic merge-on-heal. Recovery after partition heals: (1) all replicas will re-connect to primary automatically within replica-timeout (default 60 seconds). (2) replicas will sync with primary (may take time if divergence is large). (3) verify consistency: check DBSIZE and INFO replication on all nodes. (4) if cascade replication was in place: replica-2 might be behind. Run redis-cli -p 6381 REPLICAOF NO ONE to stop it from replicating from replica-1, then REPLICAOF 6379 to connect directly. Monitor: measure replication lag (time for write to reach all replicas), alert if >1 second.

Follow-up: If you detect a partition is healing, how would you verify it's safe to resume writes before it fully heals?

You're implementing a disaster recovery strategy: backup primary A in region-1, backup primary B in region-2 (geographically remote). You want both to act as independent primaries for local reads/writes, but periodically sync. However, you're concerned about write conflicts: if user X updates their profile in region-1, then immediately updates it in region-2, which write should win? How do you design for this?

This is active-active replication with potential conflicts. Pure Redis replication is last-write-wins (LWW), which is order-dependent and breaks if writes aren't properly sequenced. Solutions: (1) distributed consensus: use Redlock or Zookeeper to coordinate writes across regions. Each write acquires a lock in the majority region, ensuring only one primary can write at a time. Trade-off: latency (lock RTT + replication). (2) conflict-free replicated data types (CRDTs): instead of overwriting, use mergeable data structures (e.g., Max(timestamp) for "most recent value", counters for += operations). Implement via Lua scripts: EVAL 'local new_ts = tonumber(ARGV[1]); local old = redis.call("get", KEYS[1]); if not old or tonumber(redis.call("hget", old, "ts")) < new_ts then redis.call("set", ...) end' (keep write with highest timestamp). (3) event sourcing: instead of storing state, append all changes to a log (e.g., Redis Stream). Both regions append their own events. During sync, replay events in order and apply deterministically. (4) versioning: each write increments version. On conflict, keep both versions (user:profile:v1, user:profile:v2) and let application merge. Implementation in Redis: (1) use Streams for event log: XADD region:events * user:123 "profile-update" "name:Bob", (2) use HSCAN to sync streams across regions, (3) replay in order on both primaries to ensure consistency. Test: simulate concurrent writes to both regions (redis-benchmark -h region1 SET key value & redis-benchmark -h region2 SET key othervalue) and verify final state is consistent across regions after replication.

Follow-up: If you choose event sourcing, how would you handle events that fail to apply on the secondary region (e.g., due to data corruption or schema changes)?