MongoDB Interview — Oplog and Replication Internals

Your MongoDB replica set experiences sudden network partitions during peak traffic. The primary processes writes normally, but secondaries start accumulating a 45-second replication lag. You notice the oplog on the primary has grown to 2GB. When the partition heals, one secondary is completely unable to sync and throws an error: "cannot find oplog entry matching ts". Walk through the root cause and your recovery strategy.

This is a classic oplog window exhaustion scenario. Here's what happened: (1) During the partition, the primary kept writing and advancing its oplog with new entries. (2) The oplog is a capped collection with fixed size. On this instance, it appears to have ~2GB capacity. (3) While the secondary was partitioned away, the primary continued writing and the oplog window "moved forward." (4) The secondary's lastApplied timestamp fell behind the oldest entry still in the oplog (the oplog tail). (5) When the partition healed, the secondary tried to sync using its saved sync point, but that ts no longer exists in the oplog window.

Recovery strategy: (A) Check the oplog window size and current oldest entry: rs.printReplicationInfo(). (B) On the stalled secondary, check its sync state: db.getSiblingDB('admin').command({replSetGetStatus: 1}). (C) If the secondary is truly beyond the oplog window, trigger a full resync by deleting the local database on the secondary: db.dropDatabase() after stopping the mongod. The secondary will then perform initial sync from the primary. (D) Consider expanding your oplog size for future spikes using: rs.reconfigOplogSizeMB(newSizeMB) or by restarting with a larger --oplogSizeMB.

Follow-up: How would you calculate the minimum safe oplog window size given your write throughput and acceptable replication lag? What metrics would you monitor to predict when you're approaching oplog exhaustion?

You've deployed a reporting application that reads from a secondary via a change stream for near-real-time analytics. The team reports that their change stream consumer keeps getting disconnected with "operation exceeded time limit" errors, and they're missing data updates. The secondary shows 120-second replication lag during business hours. Your oplog is configured at 5GB on a 1TB data volume. What's happening and how do you fix it?

This is a replication lag and change stream interaction problem. Root causes: (1) The secondary can't keep pace with the primary's write volume during peak hours. High lag means new oplog entries are being created faster than the secondary can apply them. (2) Change streams on secondaries have a timeout mechanism—if the secondary falls too far behind, the cursor expires. (3) The consumer's TCP connection gets terminated by the secondary when the replication lag exceeds the change stream timeout (default 30 minutes for long-running cursors, but connection limits kick in earlier).

Fix strategy: (A) Scale vertically or horizontally: add read replicas to distribute query load, or upgrade hardware on the secondary for faster disk I/O and CPU. (B) Optimize the primary's write pattern: batch writes, use bulk operations, and reduce document size if possible. (C) Increase oplog size temporarily to allow the secondary more catch-up time: rs.reconfigOplogSizeMB(10000). (D) Switch the change stream consumer to read from the primary if consistency is critical, or implement retry logic with exponential backoff on disconnects. (E) Monitor oplog apply lag: db.admin.aggregate([{$currentOp: {}}]) to see replication worker progress.

Follow-up: If moving the change stream to the primary isn't acceptable due to write load, how would you architect a dedicated replication reader that tails the oplog without blocking the secondary's replication thread?

Your team is building a cross-datacenter sync system using oplog tailing. You've written a custom application that reads from the oplog on the primary via a tailable cursor, applies writes to a secondary replica set in a different region. However, you notice that after 48 hours, the tailing application has skipped approximately 2% of operations. The oplog on the source primary is 20GB and cycles every 3 days. Your cursor was using { ts: lastKnownTs } but somehow jumped ahead. Why did this happen and how do you prevent it?

This is a subtle oplog tailing bug caused by oplog entry collation and resumption logic. Root causes: (1) You're resuming from lastKnownTs, but MongoDB's oplog isn't strictly time-ordered at the microsecond level when multiple threads write concurrently. Two operations can have overlapping or identical timestamps. (2) When you resume with {ts: {$gt: lastKnownTs}}, you skip any concurrent operations that have the same ts but appeared after your cursor position within that same timestamp slice. (3) Secondaries use the oplog's internal document _id and ts combination for uniqueness, not ts alone. (4) After 48 hours, you've likely restarted your tailing process, lost your cursor state, and resumed from saved ts, missing entries in that ts batch.

Prevention strategy: (A) Use the sessionId and lsid fields in the oplog for exactly this reason—MongoDB added these to provide unique operation identity. Track both ts and the oplog's _id: {_id: {$gt: lastKnownId}}. (B) Implement checkpointing: persist both ts and _id after each batch is successfully applied downstream. (C) Use MongoDB's builtin replication sync: deploy a proper secondary replica in the remote datacenter instead of custom tailing. If you must tail, use the find({ts: {$gte: checkpoint}}, {tailable: true, awaitData: true}) pattern with graceful restart logic. (D) Monitor the oplog tail lag continuously and alert if replication falls behind.

Follow-up: How would you implement idempotent replay of oplog entries to handle exactly-once semantics in your cross-datacenter sync if the consumer crashes mid-batch?

You're investigating a "replSetGetStatus" report showing that a secondary has been "RECOVERING" for 6 hours. The primary is healthy, network latency is normal (~5ms), and the secondary's disk I/O is not saturated. You notice the oplog on both primary and secondary appears normal in size. However, the secondary's "oplogTruncationPoint" (visible in internalState) is significantly behind the primary's. Secondary logs show repeated "[RS] retrying initial sync" messages. Why is initial sync stalling and what do you check?

This is a stalled initial sync problem, often caused by oplog truncation windows or checkpoint misalignment. Root causes: (1) Initial sync starts a snapshot read and then tails the oplog from a specific checkpoint. If the oplog on the primary gets truncated faster than the secondary can apply the initial data, the sync point falls out of the oplog window. (2) The oplog truncation point on the secondary lags because initial sync writes directly to the local database, and these writes don't advance the truncation point as quickly as they should. (3) The secondary keeps restarting initial sync because it can't find the required oplog entries to transition to secondary state. (4) This is exacerbated if you have concurrent large operations (bulk imports, re-indexes) on the primary that accelerate oplog growth and truncation.

Diagnostic and fix strategy: (A) Check the oplog gap: rs.printReplicationInfo() on both primary and secondary. Calculate: primary oplog window in seconds vs. secondary's sync progress. (B) Check the sync source's oplog size and write rate: db.oplog.rs.stats(). (C) Temporarily increase the oplog size to widen the window: rs.reconfigOplogSizeMB(largerSize). (D) Pause large operations on the primary during initial sync if possible. (E) Move the secondary to a faster network or closer to the primary to reduce latency. (F) If the secondary still stalls, manually reset it: stop mongod, delete the local database, and restart—mongod will attempt initial sync again from a fresh state.

Follow-up: How do you calculate the minimum required oplog size to guarantee initial sync never falls out of the oplog window for a given primary write throughput?

Your production cluster is experiencing a issue where the primary's storage engine is holding 15GB of memory, but only 2GB is actual document data. The secondary shows stable memory usage at 4GB. Investigation reveals a cascade: the primary is experiencing high update churn (clients updating the same documents repeatedly), and the oplog is growing to 50GB. Your monitoring shows that cache hit ratio on the secondary is only 40% (should be 70%+), and the secondary is constantly evicting oplog entries from cache. Why is the primary's cache bloated and what's happening to the secondary's oplog tailing efficiency?

This is a memory fragmentation and oplog pressure scenario. Root causes: (1) High update churn on the primary means many entries accumulate in the oplog (each update = one oplog entry). (2) The storage engine's cache is filling with oplog data instead of working set (frequently accessed documents). (3) The oplog is a capped collection, but it's still in the cache until it's evicted naturally by LRU. With 50GB of oplog entries cycling through, the cache becomes dominated by stale oplog data. (4) The secondary tries to apply these oplog entries, but the high volume evicts the secondary's working set from cache, causing page faults. (5) Low cache hit ratio (40%) means the secondary is thrashing: constantly reading from disk to apply oplog entries instead of keeping hot data in memory.

Fix strategy: (A) Address the root cause on the primary: batch updates together, use bulk operations, and reduce the frequency of single-document updates. Consider denormalization to reduce churn. (B) Increase the cache size if possible: the primary needs more room to hold both working set and oplog. (C) Use compression on the oplog or implement journal-based oplog pruning to reduce oplog size. (D) Monitor update rate: db.serverStatus().opcounters.update and set alerts if it exceeds normal thresholds. (E) On the secondary, increase cache size to 70%+ of available memory. (F) Consider deploying a dedicated oplog reader secondary to separate query load from replication sync.

Follow-up: If you can't change the application's update pattern, how would you use write concern levels and journal settings to reduce oplog churn while maintaining durability?

You've implemented a monitoring dashboard that alerts whenever a secondary's replication lag exceeds 30 seconds. During a 2-minute incident, this alert fires repeatedly for all three secondaries simultaneously. When you check rs.status(), the secondaries show lag of 45-60 seconds. However, the network is healthy, disk I/O is normal, and the primary shows no long-running operations (oplog entries are tiny, ~500 bytes each). You stop a background batch job on the primary (a 10-second cron task), and immediately all three secondaries catch up within 5 seconds. Why did this small background job cause global replication lag?

This is a replication threading contention issue. Root causes: (1) MongoDB's replication engine on secondaries uses a thread pool to apply oplog entries. The default pool size is small (4-8 threads depending on CPU cores). (2) The background job on the primary, while small, generates many oplog entries in rapid succession (many small writes are worse than a few large ones). (3) These oplog entries pile up on the secondaries, exceeding the replication thread pool's throughput capacity. (4) The replication thread pool becomes a bottleneck: writes are queued faster than they can be applied. (5) Even though each operation is fast (500 bytes), the volume creates backlog. This is "oplog ingest lag," distinct from network latency.

Investigation and fix: (A) Check replication worker queue depth: db.serverStatus().repl shows `replicationProgress` and indicates lag. (B) Monitor oplog apply lag: db.admin.aggregate([{$currentOp: {allowDiskUse: true}}]) to see how many operations are queued. (C) On the secondary, increase replication throughput: use WiredTiger's journal batch settings to apply writes in larger groups. (D) Optimize the background job: batch its writes together on the primary using bulk operations, reducing oplog entry count. (E) If feasible, run the background job during maintenance windows or on a single secondary to avoid replica set-wide lag. (F) Consider adding more replication threads by tuning `--replicationWorkerThreadCount` (though this requires restart).

Follow-up: How would you design a fair backpressure mechanism so that the primary's write rate self-throttles if any secondary falls beyond a 60-second lag threshold?

You're scaling a sharded cluster from 2 shards to 4 shards. During the chunk migration process, you notice the oplog on each shard is growing at 5x its normal rate. The primary shards show replication lag creeping up to 90 seconds on some secondaries. The chunk moves are completing successfully, but replication is struggling. You examine the oplog and see massive entries (some 100KB+ documents). Why does chunk migration amplify oplog growth and replication lag?

This is a sharding migration and oplog amplification problem. Root causes: (1) During chunk migration, the destination shard must receive all documents in the migrating chunk. This happens via a full document insert/update on the destination's oplog, not just an index of what moved. (2) Each migrated document appears as a large oplog entry (full BSON size, often 10-100x smaller than the application's logical document but still large for oplog purposes). (3) Simultaneously, the destination shard generates oplog entries for the chunk state changes, and the source shard generates deletion/cleanup entries. (4) This creates an oplog write storm: in 5 minutes, you might generate as much oplog as you normally do in several hours. (5) Secondaries can't keep up: their replication thread pool is flooded with huge documents, causing lag.

Mitigation strategy: (A) Stagger chunk migrations: don't migrate all chunks at once. Migrate one or two at a time, allowing replication to catch up between migrations. Use `sh.enableAutoSplit()` with limits: `{ maxChunkSizeBytes: 250MB }`. (B) Increase oplog size temporarily before migration: rs.reconfigOplogSizeMB(largerSize). (C) Monitor oplog growth during migration: set alerts if oplog write rate > 2x baseline. (D) On secondaries, increase the replication worker thread pool: `--replicationWorkerThreadCount=16` (restart required). (E) Consider using a maintenance window for large rebalancing operations to reduce impact on production workloads. (F) Use sharded backups (snapshot-based) instead of logical backups to avoid secondary lag issues during migrations.

Follow-up: If you can't control the chunk migration rate, how would you architect a separate monitoring shard that consumes the oplog asynchronously without impacting production replica set replication?

Your MongoDB cluster experienced a hard power loss. After recovery, you restart the replica set. The primary comes up normally, but one secondary is stuck in "STARTUP2" state (secondary initialization) for over 30 minutes. The mongod logs show: "cannot apply oplog entry due to missing prerequisite entry" and "rollback in progress." When you check the oplog, the entries have timestamps that are ahead of the secondary's application point, but the entries themselves are corrupted or partially written (you see truncated BSON). How do you recover this secondary?

This is a crash recovery and oplog corruption scenario. Root causes: (1) The power loss caused an unclean shutdown. MongoDB's write-ahead logging (journal) should recover, but oplog entries at the tail of the oplog file may have been partially written (crash during a large write). (2) The secondary's in-memory state was lost. On restart, it tries to apply oplog entries from its saved checkpoint, but some of those entries are corrupted or truncated in the journal. (3) MongoDB's replication engine detects inconsistency and attempts a "rollback"—undoing local writes to reach a common state with the primary. (4) The corrupted entries prevent this rollback from completing, leaving the secondary stuck in STARTUP2.

Recovery strategy: (A) Check the journal and oplog corruption: run `mongod --repair` on the secondary (in standalone mode). This rebuilds the journal and removes corrupted entries. WARNING: This may lose recent data. (B) If --repair doesn't work, force the secondary to perform a clean initial sync: (1) Stop mongod on the secondary. (2) Delete the local database: rm -rf /data/db/local.* /data/db/collection-0-*. (3) Delete the journal: rm -rf /data/db/journal/*. (4) Restart mongod. The secondary will detect missing local data and request initial sync from the primary. (C) Verify the primary's oplog integrity: db.oplog.rs.find().hint({$natural: -1}).limit(1) to check the last entry. (D) Once the secondary recovers, enable oplog validation during startup: MongoDB can detect corruption early. (E) Consider enabling block-level snapshots or ZFS checksums to detect corruption sooner in future incidents.

Follow-up: How would you design a system to detect and quarantine corrupted oplog entries in real-time so they don't cascade to secondaries or cause recovery issues?