Redis Interview — Sentinel, Failover, and High Availability

Your Redis primary server crashes at 2 AM. Sentinel detects it within 30 seconds and promotes a replica. However, 50 clients are still trying to connect to the old primary (DNS still resolves to the old IP). For 3 minutes, they get connection refused. How do you design failover to minimize this impact?

This is classic failover discovery lag. Use Redis Sentinel's SENTINEL GET-MASTER-ADDR-BY-NAME command in client code: instead of hardcoding the primary IP, query Sentinel (which knows about failovers instantly). Implement this in client libraries via a Sentinel-aware connection pool. In redis.conf for Sentinel, set min-replicas-to-write 1 and min-replicas-max-lag 10 to ensure the replica is caught up before it's eligible for promotion. On the primary, set replica-serve-stale-data no so replicas reject reads during failover. Test: set sentinel down-after-milliseconds to 5000 (quick detection) and parallel-syncs to 1 (one replica at a time syncs after failover, others wait to avoid thundering herd). For the 3-minute lag: deploy a client-side retry with exponential backoff and Sentinel queries. Example: ioredis and go-redis have built-in Sentinel support. Also monitor with SENTINEL MASTERS to verify failover happens, and run CLIENT LIST to see connection states during failover.

Follow-up: If the replica you promoted has uncommitted writes (in the replication backlog but not synced), what data loss occurs and how do you prevent it?

Sentinel is monitoring 3 Redis primaries. After a network partition, 2 Sentinels can't reach the majority quorum. They declare the primaries down, but they're still running and accepting writes from the partition where they live. Now you have split-brain: 2 primaries writing independently. How do you prevent and recover from this?

Split-brain happens when Sentinel loses quorum. Prevent by: (1) running 5+ Sentinel nodes (3 is minimum, 5 is safer), (2) setting sentinel quorum 3 on 5 nodes (majority = 3), and (3) deploying Sentinels across geographically isolated zones. During partition, the partition with <3 Sentinels will fail to reach quorum and won't promote any replica—primaries won't accept writes from that partition due to min-replicas-to-write constraints. Recovery: (1) fix the network partition, (2) Sentinels will re-sync and elect a single master, (3) the partition that lost quorum will automatically step down. To detect split-brain during the event: monitor SENTINEL MASTERS on all Sentinel nodes and alert if role differs. For forensics: use LASTSAVE on each primary to identify which partition had writes (later timestamp), then export keys from the winning primary and replay missed writes from the losing partition if critical. Long-term: implement read-from-replica pattern for non-critical data to reduce failover blast radius.

Follow-up: During split-brain, if both partitions have replicas, how does Sentinel decide which primary to promote in each partition?

Sentinel has been running for 6 months and works fine, but recently 10-20% of CLIENT SETNAME commands fail with errors during connection. After restarting Sentinel, failover is slow (30 seconds instead of 5). You check SENTINEL MASTERS and see replica-offset is not advancing on one replica. What's happening?

Sentinel's connection pool or internal replication state is corrupted. First, check Sentinel logs with SENTINEL LOG-COUNT or view /var/log/redis/sentinel.log (if available). Look for SLAVEOF errors or replication-lag warnings. If replica-offset isn't advancing, it's blocked in replication. On the lagging replica, run INFO replication and check role:slave, master_repl_offset, and slave_repl_offset. If slave_repl_offset is frozen, the replica connection to primary is broken (use CLIENT LIST to verify). Temporary fix: restart Sentinel (SHUTDOWN on Sentinel instance), which clears its connection pool. Permanent fix: (1) update sentinel.conf with sentinel deny-scripts-reconfig yes and sentinel failover-timeout 180000 (reset timeout after too many failover attempts), (2) monitor SENTINEL MASTERS output every minute and alert if any replica's replica_repl_offset stops increasing for >30 seconds, (3) use redis-cli --stat to graph replication lag in real-time. Run MONITOR on Sentinel to watch failover decisions live during testing.

Follow-up: If Sentinel kept attempting failover every 30 seconds but failed each time, how would you break the retry loop without restarting?

You run Sentinel with a notification script that alerts ops when failover starts. After 500 failovers in production, the script is called but your PagerDuty/Slack integration doesn't trigger consistently. SENTINEL FAILOVER-COUNT shows 500, but only 200 alerts arrived. What's broken?

Sentinel notification scripts can fail or block, causing subsequent notifications to pile up. Check sentinel.conf for sentinel notification-script and sentinel client-reconfig-script settings. If the script hangs or exits with non-zero, Sentinel may skip subsequent notifications. Verify the script (1) runs quickly (<1 second), (2) handles signals (SIGTERM, SIGINT) to avoid blocking, (3) has error handling and doesn't crash. Test manually: sentinel-cli -p 26379 SENTINEL SIMULATE-FAILURE to trigger a fake failover and watch if the script executes. If script fails, check /tmp or /var/log for script output/errors. Better approach: replace scripts with Sentinel EVENT subscriptions. Use redis-cli SUBSCRIBE __sentinel__:hello from multiple clients to monitor Sentinel state changes without scripts. For production: (1) implement idempotent alerting (don't alert twice for same failover), (2) use Sentinel's built-in +failover-started event, (3) monitor SENTINEL FAILOVER-COUNT vs alert count and alert if delta >5.

Follow-up: If a notification script needs to make a network call to your alerting system and the network is slow, how would you prevent it from blocking Sentinel?

You have 3 Sentinels, but Sentinel 1 keeps thinking the primary is down (pfail every 10 minutes) while Sentinels 2 and 3 say it's healthy. This causes flapping: Sentinel 1 proposes failover, gets rejected by quorum, then goes quiet for 5 minutes. SENTINEL MASTERS shows inconsistent state across Sentinels. How do you troubleshoot and fix?

This is clock skew or Sentinel 1's network to primary is flaky. First, check NTP sync on all nodes: ntpstat and date -u to verify clock is within 1 second across Sentinel and Redis hosts. If Sentinel 1 is ahead/behind, use timedatectl set-ntp true to re-sync. If clocks are sync'd, Sentinel 1 has packet loss to primary. Use ping -i 0.1 from Sentinel 1 to check latency spikes. On the primary, run SLOWLOG GET to see if commands are slow (which looks like primary is down to Sentinel). If slow, investigate with LATENCY DOCTOR. For immediate relief: increase sentinel down-after-milliseconds from 30000 to 45000 on Sentinel 1 only (sentinel down-after-milliseconds 45000) to give it more time to connect. Run SENTINEL SET to apply without restarting. Long-term: (1) monitor Sentinel connection pool with SENTINEL INFO-CACHE, (2) set sentinel notification-script to alert on pfail/sdown state changes, (3) run sentinel on same subnet as primary/replicas to minimize latency, or use container network namespaces if in Kubernetes.

Follow-up: If fixing clock skew requires a reboot and you can't restart Sentinel during business hours, what's a temporary fix?

A replica was promoted to primary via SENTINEL FAILOVER. The old primary comes back online 2 hours later but Sentinel doesn't automatically resync it back as a replica. It's still a primary, creating risk of split-brain if another failover happens. How do you safely resync it?

The old primary doesn't know it's been replaced. Sentinel won't force it to resync automatically (this is safe—avoids cascading failures). You must manually demote it. First, verify topology: run SENTINEL MASTERS to see which node is the current primary, then INFO replication on the old primary to see its state. If role:master, run REPLICAOF (or SLAVEOF in older versions) on the old primary to make it a replica. This initiates full resync (slow but safe). Verify with INFO replication and watch repl_backlog_first_byte_offset increase. Alternatively, use CLUSTER FAILOVER (if using Cluster) or let Sentinel do it: run SENTINEL RESET to reset Sentinel's knowledge and SENTINEL MONITOR to re-learn the topology. Then run SENTINEL FAILOVER-TAKEOVER if needed. For production: (1) set replica-read-only yes on all replicas so the old primary can't accept writes, (2) run CLIENT LIST on old primary to check connection count before failover (should drop to near-zero if clients are Sentinel-aware), (3) monitor ROLE command every 30s and alert if a node changes role unexpectedly.

Follow-up: If the old primary has writes that weren't replicated to the new primary (due to the network partition), how do you recover that data?

Your Sentinel setup monitors 5 Redis primaries. Sentinel detects primary 3 is down and successfully fails over to replica A. But 1 week later, you run MONITOR on primary 3 and see it's accepting commands again. Sentinel still lists it as the old primary (not the current primary). Replication is confused. How did this happen and how do you fix it?

Primary 3 was restarted (likely auto-restart via systemd or container orchestration) and came back online. Because Sentinel configuration for primary 3 points to it by address (IP:port), it will re-connect and see that primary is now up. However, primary 3 is stale—it was down for a week. If it was promoted to replica before shutdown, it might still hold old data. Check: (1) LASTSAVE on primary 3 to see last save time, (2) INFO replication on replica A (new primary) to see its primary_repl_offset, (3) INFO replication on primary 3 to compare. If primary 3's offset is far behind, it's stale. To fix: (1) run FLUSHALL on primary 3 if it's corrupted, or (2) let it resync as a replica by running REPLICAOF on primary 3. Verify resync with INFO replication. Long-term: (1) disable auto-restart for Redis if it's unintended (or make it start in replica mode), (2) use sentinel deny-scripts-reconfig yes to prevent Sentinel from modifying roles during failover, (3) monitor primary startup time with STARTED_TIMESTAMP in INFO server and alert if a primary restarts unexpectedly. Run CONFIG GET save to verify persistence settings haven't changed.

Follow-up: If primary 3's data was critical and you need to preserve writes that happened after it went down, what's the recovery strategy?

You're implementing failover in a microservices environment. 10 services use Redis: 5 use Sentinel connection pooling, 5 use a hardcoded primary IP in config. After a failover, the 5 services with hardcoded IPs experience a 5-minute outage while you update config and redeploy. The 5 with Sentinel pooling recover in 10 seconds. What's the long-term solution for this heterogeneous setup?

Migrate all services to Sentinel-aware clients, but do it incrementally. Immediate: (1) update service configs to read primary IP from a service discovery system (Consul, Kubernetes ConfigMap) that Sentinel updates on failover, (2) use DNS aliases (primary.redis.local) that ops can update during failover. Intermediate (2-3 months): (1) retrofit hardcoded-IP services to query Sentinel on startup and periodically (every 60 seconds) to refresh primary address, (2) use connection pooling with health checks to detect primary changes. Long-term: (1) enforce Sentinel-aware clients at infrastructure level (require ioredis, go-redis, Jedis >= 3.0), (2) implement anti-affinity: run Sentinel and Redis on different machines/pods so neither dominates resources. For this migration: (1) deploy a local Redis Sentinel proxy on each host that services connect to (localhost:26379), which handles failover locally, (2) use Redis Cluster instead of Sentinel if you want automatic failover without external monitoring. Test with chaos-engineering: run SENTINEL FAILOVER-TAKEOVER during business hours in staging and measure recovery time per service.

Follow-up: If you're running Redis in Kubernetes and using a Helm chart for Sentinel, how would you automate failover discovery for services without modifying their config?