System Design Interview — Consensus Algorithms: Raft and Paxos

You're implementing Raft consensus for a distributed key-value store. Your 5-node cluster is running fine in steady state. But during a network partition (2 nodes isolated from the other 3), the isolated pair still accepts writes. Later, when the network heals, you have conflicting data. Walk through what went wrong and how Raft should have prevented this.

This is a classic problem: the isolated 2-node pair shouldn't accept writes during partition. Here's what should happen: (1) Initial state: 5-node cluster, leader is Node A. (2) Network partition occurs: Node A + B on one side, Node C + D + E on the other. (3) Correct behavior: Node C's group (3 nodes) has quorum (majority of 5). They detect leader A is dead (no heartbeat), elect a new leader (say Node C). Node C can now accept writes. Node A + B are minoritized—even though A thinks it's the leader, it can't get 3 votes to commit anything. So A should reject writes. (4) What went wrong: likely, Node A didn't enforce the "quorum" check. Raft requires: to accept a write, the leader must be able to replicate to a quorum of nodes. If A can only reach B (2 nodes), it's a minority and should reject. Instead, A might have naively accepted writes. (5) Root cause in code: likely missing logic to track "replication factor" or "quorum size". When a leader gets a write, it should: send to all followers, wait for quorum ACKs before committing. If it can't get quorum (only B responds, C/D/E don't), the leader should reject the write or buffer it. (6) After partition heals: both sides have accepted writes. Data conflict. This is catastrophic—violates safety. Prevention: (1) Strict quorum logic: before accepting any write, check if you can replicate to quorum. If not, reject or buffer. (2) Configuration change handling: if the cluster size is unknown or misconfigured, enforce majority-of-all-possible-nodes. (3) Testing: chaos test network partitions regularly. Verify isolated partitions don't accept writes. For repair: (1) Manual intervention: stop the minority partition, wipe data, rejoin cluster. (2) Quorum commit recovery: ensure the majority partition is the one that retains data. Minority partition data is discarded. (3) Write durability: all accepted writes must be on the majority. Minority writes are lost—this is acceptable in CAP theorem (sacrifice availability). Lesson: Raft's brilliance is the quorum requirement. Any implementation that doesn't strictly enforce quorum is broken.

Follow-up: During the partition, Node A is still a "leader" but can't reach quorum. Should it still send heartbeats? What should followers on its side do?

You're monitoring a 9-node Raft cluster. You notice: (1) leader is not elected, election timeout keeps resetting, (2) nodes have conflicting log entries, (3) replication lag is 10+ seconds. The cluster is in a livelock state. What's your debug process and how do you recover?

Livelock in Raft usually means election is failing or logs are severely out of sync. Diagnostics: (1) Check if there's a leader: query each node's state. If no leader elected, you're in election mode. If multiple leaders think they're leaders, you have split-brain (shouldn't happen if Raft is correct). (2) Check term numbers: Raft assigns each leader a term. If term numbers are diverging widely across nodes, something's wrong. All nodes should be at the same term (or differ by at most 1). (3) Check logs: look at each node's log. Are log entries at index 1-100 the same across all nodes? If not, some nodes' logs have diverged. This indicates replication failed or was interrupted. (4) Check for network issues: are nodes unable to communicate? Measure latency between nodes. If > 100ms or packet loss > 1%, investigate network. (5) Check for slow disks: Raft logs are persisted to disk. If disk I/O is slow (> 100ms per write), election and replication will be slow. Check disk latency. (6) Analyze logs: look at last election attempts. What's the error? Common issues: (a) Election timeout: nodes keep resetting election timer (shouldn't happen). This means heartbeats are arriving, so leader should have been elected. (b) Conflicting votes: nodes voting for different candidates. (c) Candidates crashing before term increments. Root causes in order of likelihood: (a) Misconfigured cluster (wrong node IDs, stale membership). (b) Clock skew: nodes have significantly different clock times, causing timeouts to behave unexpectedly. (c) Slow replication: logs are getting duplicated/corrupted, causing conflicts. (d) Memory leak or performance issue: Raft state machine is consuming too much memory, causing GC pauses. For recovery: (1) Immediate: stop the cluster. Check if any node has a valid log. Usually one node is "most up-to-date". (2) Verify quorum: is a quorum of nodes healthy? If yes, proceed. If no, restore from backup. (3) Reset cluster state: (a) Pick the most up-to-date node as the seed. (b) Wipe other nodes' logs. (c) Re-add nodes one by one to the cluster. (d) Use a forced reconfiguration command (available in most Raft implementations, e.g., etcd's "force-new-cluster"). (4) Gradual restart: don't start all nodes at once. Start the seed, wait for election timeout (usually 1 second), then start other nodes. This gives the seed time to become leader before nodes start requesting votes. (5) Verify recovery: after restarting, check: (a) single leader elected. (b) all nodes at same term. (c) all nodes have same log. (d) replication lag is 0. Testing: implement chaos tests that break Raft deliberately: kill leader mid-replication, corrupt logs on disk, introduce network latency, kill quorum. Verify cluster always recovers correctly.

Follow-up: If you force a new cluster (wiping logs) to recover, won't you lose committed data?

You're implementing Raft replication. When the leader sends log entries to followers, some followers respond slowly or not at all (network issues). Should the leader wait for all followers, or should it commit once a quorum acknowledges? Describe the implications of each approach.

This is a fundamental design choice in Raft. Two approaches: (1) Wait for all followers (unanimous): leader sends log entry, waits for ACK from all 9 nodes. Only commits when all have confirmed. Pros: maximum durability (data on all nodes). Cons: if one node is slow, entire cluster stalls (one slow node brings down throughput). (2) Wait for quorum (majority): leader sends log entry, waits for ACK from majority (5 of 9 nodes). Commits as soon as quorum acks. Pros: tolerates slow nodes. If 4 nodes are slow, cluster keeps running at full speed using the 5 responsive nodes. Cons: if a node crashes after acknowledging but before persisting, data on that node is lost (but quorum means data is still on a majority, so it's safe). Raft uses approach (2): quorum. This is critical for liveness. Why? Consider cluster of 9 nodes, one of which is dead. If you require unanimous acks, you're waiting for a dead node forever. Cluster is down. But if you require quorum (5), the 8 live nodes can always form quorum. Cluster keeps running. Implications: (1) Durability: data is durable once confirmed by quorum. If a minority of nodes crashes, data survives (it's on the majority). If a majority crashes, data is lost. This is acceptable in CAP theorem (partition tolerance requires sacrificing some durability). (2) Replication lag: quorum commitment means some followers lag behind. If a node crashes, it might not have all latest entries. On recovery, it syncs from leader (catch-up replication). (3) Write latency: quorum approach is faster because you're not waiting for all nodes. Latency is P95 of quorum response times, not max. (4) Optimization: leader can send entries to followers in parallel (not wait for one to respond before sending to next). Collect responses asynchronously. Commit once quorum is reached. Example (9-node cluster): (1) Leader writes entry to its log: "set key=value". (2) Leader sends to all 9 followers. (3) Followers respond: Node 1 (10ms), Node 2 (50ms), Node 3 (100ms), Node 4 (down), Nodes 5-9 (all respond within 30ms). (4) After 30ms, leader has 6 acks (including self) = quorum. Commits immediately. (5) Nodes 4, 3 still haven't acked. Leader will keep retrying them. (6) If Node 4 comes back online, leader sends all entries it missed (catch-up). For implementation: (1) leader maintains: (a) commitIndex: highest entry committed on this leader. (b) matchIndex[i]: highest entry replicated to follower i. (2) On follower ack, update matchIndex[i]. (3) Scan matchIndex array, find the highest index where >= quorum nodes have matched. This is the new commitIndex. (4) On commitIndex advance, apply entries to state machine.

Follow-up: If the leader crashes immediately after committing an entry to a quorum but before all followers have it, can the new leader undo that entry?

You're comparing Raft vs. Paxos for your distributed database. Both guarantee safety (consistency), but Paxos is more complex. What are the trade-offs? When would you choose Paxos over Raft?

Both Raft and Paxos solve the consensus problem: multiple nodes must agree on a sequence of values in the presence of failures. Trade-offs: (1) Complexity: Raft is simpler. It's designed to be understandable: leader-based, clear roles (leader, candidate, follower), straightforward election. Paxos is notoriously complex: proposer, acceptor, learner roles, multiple phases, easy to implement incorrectly. Most production Paxos implementations have bugs. (2) Performance: (a) Raft: leader batches writes, replicates to quorum, commits. Latency: network round-trip to quorum + disk I/O. (b) Paxos: multi-phase (prepare, promise, propose, accept). More round-trips per commit. Higher latency. (c) Throughput: Raft is usually faster because of fewer phases. (3) Fault tolerance: both tolerate f failures in a cluster of 2f+1 nodes. Identical fault tolerance. (4) Leader election: (a) Raft: leader-based. When leader fails, new election happens. Simple, O(1) messaging. (b) Paxos: leaderless (in basic Paxos). Every node can propose. No election needed. More flexible, but every proposal requires coordination—slower. (note: Paxos implementations add a leader for efficiency, e.g., Google's Chubby, so this advantage disappears). (5) Reconfiguration (adding/removing nodes): (a) Raft: membership change is explicit in the log. Relatively simple. (b) Paxos: reconfiguration is harder. Need 3-phase consensus just to change membership. (6) Consistency guarantees: both achieve linearizability (strong consistency). Identical. When to choose Paxos over Raft: (1) Leaderless requirement: if you absolutely cannot have a leader (e.g., peer-to-peer system with no central authority), Paxos's leaderless nature is appealing. But in practice, you add a leader for efficiency, so this is moot. (2) Byzantine fault tolerance: basic Paxos and Raft don't handle Byzantine faults (malicious nodes). PBFT (Byzantine-fault-tolerant Paxos variant) exists, but Raft doesn't have a standard Byzantine variant. If you need Byzantine tolerance, PBFT is an option (but very complex). (3) Legacy systems: if your team has deep Paxos expertise (e.g., Google), existing Paxos infrastructure, you'd stick with Paxos for continuity. (4) Research/academic: Paxos is more well-studied theoretically. If you're doing research on consensus, Paxos is standard. In practice: almost no one chooses Paxos for new systems. Raft has won. Why? (1) Raft is understandable. Fewer bugs. (2) Performance is equivalent or better. (3) Complexity is not worth the marginal benefit. Examples: etcd (Raft), CockroachDB (Raft), Consul (Raft). Paxos is used in: Google Chubby (historical), Cassandra (lightweight version), but even Cassandra is moving to Raft-like systems for critical consensus. Recommendation: use Raft. It's battle-tested, simpler, and performs better.

Follow-up: If Raft is leader-based and the leader goes down, isn't there a period where no writes are accepted? How long is acceptable?

You've deployed a 5-node etcd cluster using Raft for service discovery. One node's disk fails, data is corrupted. It rejoin the cluster immediately. You notice the node is now replicating an infinite loop of log entries—consuming 100% CPU. What's happening and how do you fix it?

Infinite replication loop on corrupted disk is a known issue. Root cause: (1) Node's disk corrupts some log entries (bit flip, incomplete write). (2) On rejoin, leader detects the node is behind. (3) Leader sends log entries to catch up. (4) Follower persists entries to disk, but disk is still bad (or the WAL is corrupted). (5) Follower acknowledges receipt, but when it tries to read the entries from disk later, they're corrupted or missing. (6) Follower requests the entries again. Leader resends. (7) Loop repeats infinitely. Diagnostic signs: (1) CPU 100% on one node. (2) High I/O latency on that node (disk thrashing). (3) Log size exploding (entries duplicated). (4) Leader sees replication lag for that follower, keeps retrying. Fix (immediate): (1) Stop the corrupted node. Take it completely offline. (2) Cluster can continue with 4 nodes (wait, 5-node cluster needs 3 for quorum; 4 nodes have quorum, so cluster keeps running). (3) Verify cluster is healthy: check leader, replication lag for remaining 4 nodes. (4) Once stable, rebuild the corrupted node: (a) Wipe its disk completely. (b) Copy a snapshot from another node (or initialize empty). (c) Rejoin cluster. Leader will replicate all entries from scratch. (d) Monitor: ensure no infinite loop this time. Root cause analysis: (1) Disk health: run SMART diagnostics. If disk is failing, replace it. (2) WAL robustness: does your Raft implementation validate checksums on log entries? If not, add checksums. Corrupted entries are detected on read. (3) Graceful degradation: if a node detects its own disk is bad, should it crash loud or rejoin anyway? Better: crash and require manual intervention (safety). Once disk is replaced, rejoin. Prevention: (1) Use application-level checksums on log entries. SHA1 or CRC32. (2) On read, validate checksum. If invalid, that entry is corrupted; don't use it. (3) Implement snapshot + incremental backup. If a node loses data, it can restore from snapshot instead of replaying all entries. (4) Disk failure prediction: monitor SMART, alert on imminent failure. Pre-emptively replace disk before it fails. (5) Testing: chaos test disk corruption. Introduce random bit flips in WAL, verify cluster detects and recovers. For etcd specifically: (1) etcd has a "defragment" operation to compact WAL. Run periodically. (2) If corruption suspected, stop node, delete WAL file (unsafe but recovers from situation), rejoin. Leader will replay all state. (3) Use persistent storage with better durability: NVMe drives, or cloud storage (EBS, Persistent Disks) with built-in redundancy.

Follow-up: If you delete a node's WAL file to recover from corruption, how does the node catch up with the rest of the cluster?

Your Raft-based distributed KV store initially had 3 nodes. You added 2 new nodes (5-node cluster now). During the reconfiguration, the 3 old nodes can form quorum (majority of 3), so they could commit writes without the 2 new nodes. This temporarily violates your expectation. Walk through Raft's membership change protocol and how to handle this safely.

Membership changes in Raft are tricky. The naive approach (just add a node to the config) can cause split-brain. Here's why: (1) You have 3-node cluster: A, B, C. Quorum = 2. (2) You add D and E to make it 5-node cluster. New quorum = 3. (3) But until all nodes have the new config in their logs, you have an inconsistency: A, B, C think quorum is 2; D, E think it's 3 (or are being added). (4) Scenario: leader A tells B, C the new config. Before A tells D, E, a partition occurs. A, B, C on one side (think quorum is 2, don't realize D, E exist). D, E on other side. A, B, C form a majority and elect a new leader. But when partition heals, you have two clusters that committed conflicting writes. Raft's solution: "joint consensus". Membership change happens in TWO phases: (1) Phase 1 - Joint config: new config is a UNION of old and new nodes. Example: old = {A, B, C}, new = {A, B, C, D, E}, joint = {A, B, C, D, E}. Quorum during joint: BOTH old quorum AND new quorum must acknowledge. Example: both {A, B} from old AND {C, D} from new must ack. This is redundant but safe. (2) All nodes replicate the joint config in their logs. Once quorum (under joint rules) acks, commit. (3) Phase 2 - Final config: once joint is committed, leader issues second change to the final config {A, B, C, D, E}. Quorum now = 3 (majority of 5). Once final config is replicated and committed, membership change is complete. Why this works: during Phase 1, you need both old quorum AND new quorum. This means you can't have simultaneous leaders in old and new subsets. If old subset forms a leader, it needs to also get acks from new subset. New subset can't form an independent leader because it needs old subset acks. So split-brain is prevented. Implementation: (1) Membership change is initiated via a special log entry (type: CONFIG_CHANGE). (2) Leader receives AddServer(nodeD, nodeE) request. (3) Leader appends joint config entry to log: {A, B, C, D, E}. (4) Leader replicates to all nodes using joint quorum rule. (5) Once committed, leader appends final config entry. (6) Replicates using final quorum rule. (7) Once committed, change is done. Testing: simulate membership change during partition. Verify: (1) during change, quorum rules are enforced (both old and new). (2) after change completes, only new quorum is required. (3) no split-brain. Examples: etcd, Consul both implement joint consensus for safe membership changes. Key insight: don't change membership in one step. Use two-phase consensus with joint config to ensure safety.

Follow-up: During the joint consensus phase, if an old node crashes and can't ack, can the cluster still make progress?

You're running a 7-node Raft cluster. You observe: (1) high network latency between datacenters (200ms), (2) occasional network blips (packet loss, 1-second outages). (3) Raft election timeout is 1 second. You're seeing frequent leader elections (every 5-10 minutes), causing disruptions. How do you tune Raft parameters to reduce elections?

Frequent leader elections are typically caused by: (1) election timeout too low relative to network latency, or (2) heartbeat interval too high. Raft algorithm: (1) Leader sends heartbeats every heartbeat_interval (usually 150ms). (2) Follower waits for heartbeat. If none received within election_timeout (usually 1 second), it becomes a candidate and calls for election. (3) If a candidate doesn't win within election_timeout, it increments term and tries again. Problem: with 200ms latency between datacenters, a heartbeat might take 200ms to arrive. If election_timeout is 1 second, there's margin. But if packet loss causes a heartbeat to be dropped, the next heartbeat won't arrive for another heartbeat_interval (150ms). Total delay: 150-200ms latency + 150ms interval + heartbeat loss = could exceed timeout. Fix (tuning parameters): (1) Increase election_timeout: election_timeout = 2 * network_latency + margin. With 200ms latency, set election_timeout = 500-1000ms. (2) Decrease heartbeat_interval: heartbeat_interval = election_timeout / 10. If election_timeout = 1000ms, heartbeat_interval = 100ms. This ensures frequent heartbeats, reducing chance of timeout. (3) Add randomization: Raft adds randomness to election_timeout (e.g., timeout = base + random(0, base)). This prevents synchronized elections where all followers timeout at the same time. Example: election_timeout = 1000 + random(0, 1000) ms. (4) Exponential backoff on retries: if an election fails, wait longer before retrying. backoff = backoff * 1.5, capped at max_backoff. This prevents thundering herd of concurrent elections. (5) Implement RPC batching: batch multiple log entries or RPCs in one message. Reduces overhead. (6) Optimize serialization: if RPC deserialization is slow (complex data structures), optimize it. Slow RPC handling = delayed responses = leader looks dead = election. Testing parameters: (1) Simulate 200ms latency, measure election frequency. Target: < 1 election per hour under normal load. (2) Simulate packet loss (1%), measure election frequency. (3) Kill leader deliberately, measure election time (should be election_timeout, usually 1-2 seconds). Implementation example: ```toml election_timeout_ms = 2000 # 2 seconds heartbeat_interval_ms = 100 # send heartbeats every 100ms election_timeout_randomization_ms = 2000 # randomize by up to 2s ``` With these settings: (1) even if a heartbeat is lost and there's 200ms latency on next one, you have 2 seconds before timeout. (2) heartbeats are frequent, so very likely the next one arrives. (3) election timeout is randomized, preventing synchronized elections. (4) result: stable leader, elections only on actual leader crashes, not on transient latency. This is much better than default 1-second timeout in high-latency environments. Best practice: tune based on your network characteristics. Use Raft diagnostic tools to measure heartbeat round-trip time (RTT). Set election_timeout = 4 * max_observed_RTT as a safe starting point.

Follow-up: If you increase election_timeout to 2 seconds, what's the worst-case time before a crashed leader is detected and a new one is elected?

You've built a geo-replicated Raft cluster: 3 nodes in US East, 3 nodes in EU West, 3 nodes in AP Southeast. Your current design uses a single leader across all 9 nodes. During a network partition (US-EU connection breaks), you want the EU cluster to continue operating independently. But this requires promoting an EU node to leader—violating the "single leader" model. How do you redesign for geo-resilience?

Single leader across 9 geo-distributed nodes doesn't scale. You need a multi-leader architecture. Here's the design: (1) Partition Raft by region: instead of one 9-node cluster, you have three 3-node clusters: US, EU, AP. Each region has its own leader. (2) Within-region consensus: writes go to the regional leader. Replicated to other 2 nodes in region. Very fast (< 50ms). (3) Cross-region replication: once a write is committed in a region, asynchronously replicate to other regions. Use a replication log. (4) Conflict resolution: if two regions independently write to the same key, last-write-wins or use vector clocks for causality. (5) On partition heal: replication catches up. If write conflicts exist, policy determines winner. Typical: EU write wins if EU is "source of truth" for that key. Implementation: (1) Hierarchical Raft: (a) Intra-region: standard 3-node Raft cluster. Quorum = 2. (b) Inter-region: separate consensus for replication state. Leaders exchange state asynchronously. (2) Each region has: (a) Local state machine (KV store). (b) Local Raft cluster (for consistency within region). (c) Replication queue: changes to ship to other regions. (3) Leader election within region only. If US-EU partition: US has 3 nodes (quorum possible), EU has 3 nodes (quorum possible). Both can have leaders. (4) Conflict handling on merge: (a) Vector clocks: each write tagged with (region, timestamp, version). On conflict, use causal ordering. (b) Merge: reconstruct state by replaying writes in causal order. (5) Monitoring: per-region latency, replication lag, partition detection. Fallback: if region is isolated > 10 minutes, human operator must decide: merge or diverge? Example: (1) Normal: US writes "key=A", replicates to EU, AP. All regions see key=A. (2) US-EU partition: US can still write "key=B", replicate only to AP (if connected). EU is isolated. EU leader can elect independently but replication to US/AP is blocked. (3) Both regions write simultaneously: US: key=C (version 1), EU: key=D (version 2). Both complete locally. On partition heal, replication queue syncs. Merge policy: if key=(C,D), EU version wins (rule configured). (4) Final state: key=D (EU's write). Benefits: (1) each region continues operating independently. (2) low latency for writes (local leader, no cross-region consensus). (3) eventual consistency across regions. Drawbacks: (1) writes can conflict. (2) temporary inconsistency during partition. (3) more complex operational logic. This is what Google Bigtable, Dynamo, and CockroachDB do for geo-distribution.

Follow-up: In your multi-region setup, if EU region has a write and US region later overwrites it, how do you ensure users don't see "time travel" (recent change overwritten by older change)?