MongoDB Interview — WiredTiger Storage Engine Internals

Your MongoDB production database uses WiredTiger (3.6+) and stores 500GB of data. Disk space shows 450GB used. When you run a maintenance job deleting 100GB of old documents, disk usage drops to 350GB. But after restarting mongod, disk jumps back to 450GB. Explain what WiredTiger is doing with disk space and how to prevent this.

WiredTiger uses copy-on-write snapshots for transactions. When you delete 100GB, WiredTiger marks the space as free internally but doesn't reclaim disk space from the OS immediately—the data files remain 450GB because deleted pages are simply flagged unused in WiredTiger's B-tree. Upon restart, WiredTiger rebuilds its B-tree from checkpoint, and fragmentation is recreated, so disk usage appears back to 450GB.

This is normal behavior and not a bug. WiredTiger avoids compacting on deletion because compaction is expensive (rewrites entire collection). Instead, disk space is reclaimed through: (1) Creating new collections—new data uses free pages first; (2) Running compact command: `db.collection.compact()` rewrites the collection, eliminating fragmentation but is blocking; (3) WiredTiger's internal reuse pool—subsequent inserts/updates fill free pages.

To verify fragmentation: `db.collection.stats()` shows storageSize (disk used by WiredTiger) vs size (data size). If storageSize >> size, you have significant fragmentation. After deleting 100GB, run `db.collection.compact()` if you want immediate disk reclamation, but this locks the collection for ~minutes depending on size.

Follow-up: If compact() takes 20 minutes and locks the collection, how would you compact without downtime in production?

You configure WiredTiger cache_size to 40GB on a 256GB server running MongoDB. Your server has several other processes consuming 100GB. After a few hours, mongod crashes with OOM errors despite cache_size limit. Why doesn't WiredTiger respect the cache_size limit, and what's actually consuming memory?

cache_size is only the buffer pool for WiredTiger—it doesn't include memory used by: (1) Operation scratch memory for sorting, grouping, and aggregation pipelines; (2) Index traversal temporary buffers; (3) Connection threads (each connection has thread stack memory); (4) Replication oplog buffer pool (separate from cache); (5) Native malloc overhead (typically 10-20% of allocated memory).

In your case, mongod is consuming 40GB (cache) + 50GB+ (operation memory + overhead) = 90GB+. Combined with other processes' 100GB, you exceed 256GB, triggering OOM.

Fix: Set cache_size to ~50% of available server memory minus overhead for other processes. Formula: cache_size = (total_memory - process_memory - buffer_cache) * 0.5. In your case: (256 - 100 - 20) * 0.5 = 68GB, but set to 60GB to be conservative. Also limit max_cache_size_percent in wiredTiger config: `storage: {wiredTiger: {engineConfig: {cacheSizeGB: 60}}}`. Monitor with: `db.serverStatus().wiredTiger.cache` to see actual cache usage and eviction rate.

Follow-up: How would you monitor WiredTiger memory to predict OOM before it happens?

You have a very write-heavy workload: 100K inserts/sec into a collection with 10 concurrent threads. WiredTiger reports high checkpoint time (30+ seconds between checkpoints) and your write latency is 200ms. When you increase the number of WT write threads from 4 to 12, latency drops to 50ms. But after a few hours, you notice disk I/O is maxed out. What's happening with WiredTiger's write threads?

WiredTiger uses a write cache and background threads to flush changes to disk. More write threads speed up cache eviction and allow faster writing, but they also increase I/O pressure. At 100K inserts/sec, the cache fills quickly (each insert is ~1KB on average = 100MB/sec of data). With 4 threads, the cache fills faster than it can be flushed, causing contention; with 12 threads, flushing is faster, reducing latency. However, 12 threads now aggressively write to disk (I/O bound), maxing out disk bandwidth.

This is a tradeoff: more threads reduce latency but increase I/O. The optimal setting depends on: (1) Your storage (SSD vs HDD)—SSD can handle high I/O; (2) Checkpoint interval (evict_trigger and evict_target); (3) Working set vs cache size—if working set > cache, you'll thrash regardless of thread count.

To optimize: (1) Tune evict_trigger to 80 (default 95)—start eviction earlier to prevent cache saturation; (2) Use SSD storage (better random I/O); (3) Increase cache_size if possible to accommodate spike; (4) Profile which collection is driving writes: `db.serverStatus().repl.oplog.visible` vs oplog.apply time to see if bottleneck is write or replication; (5) Consider sharding to distribute write load across multiple nodes.

Follow-up: Given fixed storage (HDD), how would you design the application to handle 100K inserts/sec without maxing I/O?

You're investigating slow queries in production and use db.currentOp() to find a query that's been running for 5 minutes. The query's planSummary shows COLLSCAN on a 500MB collection. You check WiredTiger stats: cache pages evicted = 1M, evict_untracked_bytes = 500MB. The collection should fit entirely in your 64GB cache. Why is WiredTiger evicting pages when there's plenty of free cache?

WiredTiger's eviction is triggered when cache usage exceeds evict_target (default 80% of cache_size), not when cache is full. At 80% of 64GB (51.2GB), WiredTiger starts evicting pages to free space for new operations. However, evict_untracked_bytes = 500MB means you have 500MB of memory that WiredTiger can't track—this is usually pinned pages (live cursors, active transactions) that can't be evicted.

The slow COLLSCAN is evicting cache pages as it reads—each page of 500MB collection is loaded, then evicted to make room for the next page, defeating the cache entirely. This is typical for sequential scans larger than available cache.

Investigation: (1) Check if the query uses an index: `explain()` should show IXSCAN, not COLLSCAN; if COLLSCAN, add an index to the query predicate; (2) Run: `db.serverStatus().wiredTiger.cache` to see evict_pass (number of eviction passes)—high value indicates thrashing; (3) Monitor cache bytes_dirty—if > 10GB, checkpoint is behind and flushing isn't keeping up.

Fix: (1) Create index to avoid COLLSCAN; (2) If full scan is necessary, increase cache_size or run query on SECONDARY with isolated resources; (3) Use cursor batching to keep working set smaller: fetch() in smaller batches instead of loading entire result.

Follow-up: If a legitimate query must do a COLLSCAN on a 500MB collection but you only have 32GB cache, how would you optimize WiredTiger settings?

You notice MongoDB's checkpoint time is increasing over time: starts at 5 seconds, grows to 20 seconds after a week. Meanwhile, db.serverStatus() shows cache dirty pages aren't increasing, but checkpoint_time_total is linearly growing. Your disk has 200GB free. What's causing checkpoint slowdown?

Checkpoint slowdown despite constant dirty pages suggests: (1) Checkpoint file fragmentation—WiredTiger writes checkpoint files that accumulate over time; (2) File system fragmentation—repeated overwrites fragment the checkpoint file on disk, increasing I/O time; (3) Increasing number of collections/indexes—each checkpoint must write metadata for all collections/indexes, even if data size is stable.

WiredTiger maintains a checkpoint history for recovery. Each checkpoint creates a new snapshot. Over time, checkpoint files accumulate and fragment. MongoDB 5.1+ uses "document-level concurrency" which helps, but fragmentation still occurs on HDD.

Investigation: (1) Check checkpoint frequency: `db.serverStatus().wiredTiger.txn.checkpoint_count`; (2) Monitor: `db.serverStatus().wiredTiger.performance` for checkpoint durations; (3) Run `df -h` to confirm disk space is plentiful; (4) Check if collection count is growing: `db.listCollections().toArray().length`—if yes, metadata growing.

Fix: (1) Restart mongod to rebuild checkpoint (clears fragmentation)—this is the nuclear option; (2) Run compact on frequently-used collections: `db.collection.compact()` removes fragmentation and resets checkpoint; (3) For long-term, monitor checkpoint time and trigger compaction at off-peak if checkpoint time exceeds threshold; (4) Use SSD storage to mitigate fragmentation impact (random I/O is fast).

Follow-up: Design a monitoring script to automatically detect and alert on checkpoint slowdown before it impacts queries.

Your application writes documents with embedded arrays that grow over time (e.g., user events array appended every second). After 6 months, a user document has 15MB array and queries on that user are slow. WiredTiger is spending time reading the entire 15MB document even for simple updates. How does WiredTiger handle document modifications, and how would you fix this?

WiredTiger uses B-tree storage with page-level granularity. Each document is stored as a value within a B-tree page (default 4KB page). When you update a document, WiredTiger must: (1) Locate and read the entire page; (2) Deserialize all documents on that page; (3) Update the target document; (4) Reserialize the page; (5) Write back to disk. With a 15MB document, a simple field update requires reading/writing 15MB, causing high I/O and latency.

This is a schema design issue, not WiredTiger limitation. Solutions: (1) Use arrays with updateMany on separate collection: instead of single document with 15MB array, create {userId, events: [...]} documents with 1000 events each—easier to update incrementally; (2) Use capped collection for events: auto-expire old events, keeping size bounded; (3) Store events in time-series collection (MongoDB 5.0+): designed for append-only workloads, automatically optimized for compression and queries; (4) Use array operators with positional updates: instead of rewriting entire array, use {$push: ...} to append one event at a time—still expensive but less than rewriting entire doc.

Migration: Insert events into a new `userEvents` collection and stop embedding arrays in user documents. Backfill existing 15MB arrays into separate collection using batch job. Then query becomes: `db.userEvents.find({userId}).limit(100)` instead of `db.users.find({_id}).project({events: {$slice: 100}})`.

Follow-up: How would you migrate from embedded arrays to separate collection without downtime or data loss?

You're running MongoDB 4.4 with WiredTiger and see high memory usage spikes during bulk insert operations. Each bulk insert operation briefly peaks at 5GB memory despite cache_size = 8GB. The bulk insert processes 1M documents (1KB each ~1GB data). After the bulk insert completes, memory drops back to normal. Explain the memory spike and how to prevent it.

During bulk inserts, WiredTiger maintains: (1) Write cache buffer for pending writes (1GB of data being inserted); (2) Transaction state for the entire bulk insert operation (cursors, snapshots, undo logs); (3) Index update buffers—each insert updates indexes, requiring temporary buffers; (4) Replication buffer—oplog entries for replication. Combined, these push temporary memory usage above steady-state cache_size.

This is normal for bulk operations—WiredTiger optimizes for single inserts, so bulk batches create temporary overhead. The 5GB spike from 1GB input data is due to indexing overhead (typically 3-5x multiplier) plus transaction state.

Prevention: (1) Use unordered bulk operations: `collection.insertMany(docs, {ordered: false})` to allow parallel processing and reduce transaction state; (2) Batch smaller chunks: instead of 1M docs, insert 100K at a time with pauses—keeps memory spikes local; (3) Disable indexes during bulk load, then rebuild after: `db.collection.dropIndex()` before bulk insert, then `db.collection.createIndex()` after—reduces temporary memory by 60%; (4) Increase cache_size during bulk operations if possible, or throttle bulk inserts to prevent spikes.

Monitoring: Watch `db.serverStatus().memory` during bulk inserts. If resident memory approaches available system memory, you're at OOM risk. Implement application-level backpressure: if memory spike detected, pause bulk insert and retry.

Follow-up: Design a bulk insert strategy for loading 1TB dataset that minimizes memory and disk I/O overhead.

You diagnose a production incident: a runaway query scanned 2B documents in WiredTiger cache, causing cache thrashing and 30-minute latency spike. db.serverStatus().wiredTiger shows cache_bytes_dirty = 30GB (not flushing), cache_eviction_fails = 50K, cache_evict_pass = 100K (high). The query was killed but damage was done. How do you prevent this, and what does each metric tell you?

Each metric indicates: (1) cache_bytes_dirty = 30GB: 30GB of unflushed changes in cache—checkpoint is stalled or blocked; (2) cache_eviction_fails = 50K: eviction attempted 50K times but failed (no evictable pages because they're all dirty or pinned); (3) cache_evict_pass = 100K: 100K full eviction passes through cache—excessive recycling indicating cache thrashing.

The query scanning 2B documents loads pages into cache faster than they can be evicted, causing eviction to fail repeatedly. This creates a positive feedback loop: eviction fails, query continues, more data loads, eviction fails again. Meanwhile, cache_bytes_dirty grows because checkpoint can't complete (blocked by query cursors).

Prevention: (1) Set query timeout: `db.serverStatus().currentOp.secs_running > 600` kills long-running queries; (2) Set maxTimeMS on queries: `db.collection.find().maxTimeMS(30000)` ensures queries abort if they exceed 30s; (3) Implement query COLLSCAN detection and alerting: queries without index should be logged and investigated; (4) Use admission control (MongoDB 5.1+): limits concurrent operations to prevent cache thrashing; (5) Monitor cache eviction metrics and alert if eviction_fails > 1000 in a minute.

To recover from thrash after incident: (1) Restart mongod (clears cache); (2) Analyze query that caused it: check explain() plan and add appropriate indexes; (3) Implement query allowlist to prevent similar queries; (4) Schedule off-peak time to compact collections and rebuild indexes.

Follow-up: Design a monitoring dashboard to track WiredTiger health and alert on eviction failure patterns before they cause outages.