MongoDB Interview — Replica Set Elections and Failover

Your 3-node production replica set has a PRIMARY and two SECONDARYs running MongoDB 6.0 across AWS us-east-1a, us-east-1b, and us-east-1c. At 14:32 UTC, the PRIMARY (running 500 ops/sec) suddenly loses network connectivity but is still running. Within 30 seconds, you notice application latency spikes to 5+ seconds. Explain what happens during the election process and how you'd investigate the stalled PRIMARY.

When the PRIMARY loses network connectivity but remains running, the two SECONDARYs detect the PRIMARY heartbeat timeout (default 10 seconds, configured in settings.heartbeatTimeoutSecs). They initiate an election: both call for votes, compare oplogs using lastAppliedOpTime, and the one with the fresher oplog wins PRIMARY. During this ~1-2 second window, clients see "not master" errors or connection timeouts. The old PRIMARY continues writing to its oplog (going dark), creating a divergence.

Investigation command: ssh to SECONDARY, run `mongosh` then `rs.status()` to see who's PRIMARY and membership state. Check `db.serverStatus().opcounters.insert/update` on old PRIMARY to confirm it's still accepting writes locally. Run `db.getMongo().getDB("admin").runCommand({replSetGetStatus: 1})` to see detailed election info including who voted for whom. The stalled PRIMARY becomes SECONDARY once network recovers, and MongoDB's replication catches it up by rollback if necessary.

Follow-up: If the old PRIMARY wrote 50K documents during the network partition while the new PRIMARY wrote 30K different documents, what happens when network recovers? How do you prevent data loss in such scenarios?

You're running a 5-node replica set with priority distribution: node1 (priority 5, PRIMARY), node2/node3 (priority 3), node4/node5 (priority 1, in us-west-2 for DR). A critical bug in your application causes a cascading replica set membership change algorithm failure, and for 45 minutes, elections are taking 2+ minutes to complete. Your application can't determine which node is PRIMARY. Walk through how replica set elections work and explain why they might stall.

Elections use the Raft consensus algorithm. A node becomes a CANDIDATE, increments its term, and requests votes from all members. To win, it needs a majority of votes (3 of 5 here). Each node votes for the first candidate with a fresher oplog (higher term then higher lastAppliedOpTime). If no candidate achieves majority within electionTimeoutMillis (default 10 seconds), a new election is triggered.

Stalling occurs when: (1) network partitions prevent majority formation—if node1+2 are isolated from node3+4+5, neither side achieves quorum; (2) priority conflicts—the higher-priority node keeps losing elections if its oplog is older, causing thrashing; (3) arbiters failing—if your arbiter is down and you have 4 voting nodes plus arbiter (5 total), you need 3 votes but can only get 2 from running nodes.

Debug with: `rs.status()` shows lastHeartbeatRecv for each member; nodes with outdated times are unreachable. Check `db.serverStatus().repl.electionMetrics` (6.0+) for election duration and success rates. Fix partition issues, check network ACLs, verify DNS resolution from each node, and use `rs.reconfig()` to adjust priorities or remove unreliable members.

Follow-up: How would you design a replica set across 3 AWS regions to minimize election time while ensuring high availability? Should you use arbiters or tie-breakers?

You deploy a replica set with nodes in us-east-1 (PRIMARY, SECONDARY) and us-west-2 (SECONDARY, ARBITER). During a 10-minute AWS region outage in us-west-2, your PRIMARY remains up. Why don't your applications automatically failover to the us-east-1 SECONDARY? What's the correct architecture here?

The us-east-1 SECONDARY cannot become PRIMARY because you only have 2 voting nodes in us-east-1 (PRIMARY + SECONDARY), and you need a majority of 3 total voting nodes. The arbiter in us-west-2 is unreachable, so quorum cannot be formed. This is a classic architecture mistake—you've created a split-brain risk.

Correct architecture: Use 3 nodes minimum with majority in primary region (us-east-1 gets 2 nodes, us-west-2 gets 1). Or, if you want region-level HA, use 5 nodes: 2 in us-east-1 (PRIMARY + SECONDARY), 2 in us-west-2 (SECONDARY + hidden SECONDARY), and 1 arbiter in us-east-1 or neutral region. The key: majority of voting nodes must be in/accessible from the primary region to prevent accidental failover during regional outages.

Verify with: `rs.conf()` shows voting members and priorities. Calculate majority = (voting_members / 2) + 1. For your case: 3 voting nodes need 2 votes minimum. In your setup, us-east-1 outage leaves 1 node (arbiter in west), so PRIMARY can't get majority—it steps down.

Follow-up: How do you transition from this broken setup to the correct one without downtime? Walk through the reconfiguration steps.

Your PRIMARY goes down unexpectedly. A SECONDARY becomes new PRIMARY via election. Your ops team, 5 minutes later, power-cycles what they thought was still the old PRIMARY, but it actually recovered and was starting to rejoin as a SECONDARY. When it comes back up fully, its oplog has diverged by ~200 writes compared to the new PRIMARY's oplog. How does MongoDB handle this, and what's the risk?

This is the critical scenario for understanding rollback. When the old PRIMARY recovers and connects to the replica set as a SECONDARY, MongoDB detects that its oplog has diverged. It finds the common point (last operation both timelines agree on), then performs a rollback: writes all divergent operations to a rollback file in dbPath/rollback/ directory (named something like rollback.*.json), then resets the oplog to the common point and re-syncs from new PRIMARY.

The 200 divergent writes are lost from the cluster's perspective but preserved in the rollback file for manual inspection. Risk: if those 200 writes included critical transactions (e.g., payment confirmations), the application doesn't know they rolled back. The data is still in the database as of the common point, but later operations expecting those 200 writes may be inconsistent.

Prevention/detection: (1) Monitor rollback.*.json files on SECONDARYs—if they exist, send alerts; (2) Use write concern "majority" so clients only consider writes durable after reaching majority—the 200 writes wouldn't have been ack'd to app if only PRIMARY had them; (3) Check `rs.status()` after failover; if a member was SECONDARY and is now RECOVERING, it's doing rollback.

Follow-up: How do you retrieve and analyze the rollback files to understand what data was lost? Write a mongosh command to parse and display the rollback operations.

You implement a custom election priority system where the node with the fastest disk I/O should always be PRIMARY. You modify rs.conf() to set priorities on nodes based on a background script that checks disk latency every minute. However, elections now oscillate wildly—each node thinks it's the best choice and triggers re-elections constantly. What's happening, and why is this a bad idea?

MongoDB's election algorithm has built-in protection against thrashing: once a node is PRIMARY, it won't step down due to another node's higher priority unless that node calls rs.stepDown() or reaches a higher term. However, your script is calling rs.reconfig() every minute to update priorities. Each time priorities change, all nodes re-evaluate their position: if a SECONDARY now has higher priority than PRIMARY, it may request an election.

The fundamental issue: priorities should reflect stable topology differences (regional preferences, hardware tiers), not dynamic metrics. Changing them frequently destabilizes the cluster because each reconfig increments the config version and can trigger fresh elections. Additionally, disk I/O latency varies minute-to-minute—during a backup, PRIMARY might have worse I/O, causing an election, which then destabilizes that node's I/O, causing another election.

Best practice: Set priorities at deployment time reflecting topology (not dynamic). If you need dynamic load-based failover, use read preferences and connection pooling instead: direct reads to the fastest node. Or, use MongoDB Atlas which handles this internally. Verify cluster stability with: `rs.status()` and `db.serverStatus().repl` to confirm electionCount hasn't increased in the last hour.

Follow-up: If you absolutely needed dynamic failover, how would you implement it safely without modifying replica set config?

You operate a replica set with a hidden SECONDARY used only for backups. It's configured with priority 0 and hidden: true. One day, due to a misconfiguration in your automation tool, it gets priority 1 during a rolling restart. The PRIMARY crashes before the restart reaches it. The hidden node triggers an election, becomes PRIMARY, and receives application traffic—but it's on older hardware (slower disks, less RAM). Queries suddenly execute 10x slower. How do you prevent this?

This is a priority-and-tags design failure. The hidden backup node should have priority 0 (never eligible for PRIMARY) set in rs.conf(). Additionally, use replica set tags to create application-level routing rules: tag the backup node as {role: "backup"}, regular nodes as {role: "general"}, and in your connection string, specify readPreference mode with tags like {role: "general"} to avoid sending queries to backup nodes even if they become PRIMARY.

In rs.conf(), the hidden node should look like:

{ _id: 3, host: "backup-node:27017", priority: 0, hidden: true, tags: {role: "backup"} }

And in your driver connection string (e.g., Node.js MongoDB driver):

mongodb://rs0/db?replicaSet=rs0&readPreference=secondary&readPreferenceTags=role:general

Prevent misconfiguration with: (1) Use immutable infrastructure—bake priority and hidden settings into node configs, don't dynamically update them; (2) Add pre-restart validation: `rs.conf()` should always show priority 0 for backup nodes; (3) Use MongoDB Enterprise Advanced's Ops Manager to enforce rs.conf templates; (4) In automation, use rs.reconfig({force: false}) which validates majority before applying—prevents bad configs from spreading.

Follow-up: How would you structure a replica set with 5 nodes across 3 regions where 1 is a backup node, 1 is an analytics SECONDARY, and 3 handle application traffic? Design the priority, tags, and failover topology.

You're diagnosing a prolonged 45-second election on your replica set. You run rs.status() and see that one SECONDARY has optime {ts: Timestamp(1700000000, 1), t: 5} while the other has {ts: Timestamp(1700000050, 10), t: 5}. The one with older ts is winning elections. Why, and what does this mean for your replication consistency?

Elections use the optime to determine which node has the freshest oplog. The optime is a pair: [timestamp, term]. Both nodes have term 5, so MongoDB compares timestamps: 1700000000 vs 1700000050. The second SECONDARY has the fresher oplog (more recent operations), so it should win elections and become PRIMARY.

If the older SECONDARY is winning, the node reports are misleading—likely you're looking at lastAppliedOpTime (already applied) vs lastDurableOpTime (written but not applied). Or, there's a replication lag issue: the newer SECONDARY's oplog has operations not yet replicated to the older one, but hasn't reached majority yet, so those operations are uncommitted. The older SECONDARY is elected because its committed oplog is fresher.

In production, this indicates: (1) Replication is lagging—check `db.serverStatus().repl.replicationLag` on SECONDARYs; (2) If replicationLag > 5 seconds persistently, you likely have network bottlenecks or slow oplog application (usually due to slow queries on SECONDARY). Investigate oplog application time with: `db.getCollection("oplog.rs").find({ns: "mydb.mycol"}).sort({$natural: -1}).limit(10)` and check how long those ops took.

Follow-up: If the 45-second election is repeating every few minutes, what diagnostic steps would you take to identify the root cause?

You run a 7-node replica set across two data centers: DC1 has nodes 0-4, DC2 has nodes 5-6, with node 0 as PRIMARY. Your network team performs maintenance and temporarily partitions DC1 from DC2. You want to ensure DC1 keeps PRIMARY without losing data from clients writing to the old PRIMARY. Walk through the quorum logic and explain when/why the old PRIMARY might step down.

With 7 nodes, majority = 4. DC1 has 5 nodes (majority), DC2 has 2. During partition: DC1 remains a quorum and node 0 stays PRIMARY. DC2 loses quorum (only 2 nodes, needs 4) so SECONDARYs there step down. Clients writing to node 0 succeed because it has write concern "majority" and 4 of 5 DC1 nodes acknowledge. When partition heals, DC2 nodes re-join and catch up via oplog.

Edge case: if network partition is partial (DC1 can reach node 5 but not 6), then DC1 has 6 nodes visible and can achieve majority (4). DC2 only sees 1 node (itself) or 2 if they're directly connected. This works fine.

Risk: if DC1 loses quorum (e.g., 3 nodes fail), PRIMARY steps down even though DC2 has nodes available. This is intentional—MongoDB prioritizes consistency over availability. A PRIMARY without majority cannot serve writes because it can't guarantee durability.

Verify partition handling: In DC1, run `rs.status()` and confirm members 5-6 show UNREACHABLE state. Check optime—they'll be unchanged. On node 0 (PRIMARY), confirm `db.isMaster().hosts` shows all nodes, but `lastHeartbeatRecv` for nodes 5-6 is old (seconds ago). Run `db.serverStatus().repl.electionMetrics.lastElectionReason` to see the partition logic.

Follow-up: How would you design monitoring to alert if a replica set PRIMARY loses majority unexpectedly?