MongoDB Interview — Aggregation Pipeline Optimization

Your aggregation pipeline on a 2TB orders collection runs in 45 seconds but should complete in under 5 seconds. The pipeline: db.orders.aggregate([{$match: {status: "completed", createdAt: {$gte: ISODate("2024-01-01")}}}, {$group: {_id: "$userId", total: {$sum: "$amount"}}}, {$sort: {total: -1}}]). You have an index on {status: 1, createdAt: 1}. Explain what's happening and how to optimize.

The pipeline performs $match (filtering 50M documents down to 10M), then $group (scanning all 10M to aggregate), then $sort (sorting 2M unique userIds). At 45 seconds, the bottleneck is likely the $group stage scanning all 10M documents in memory. MongoDB can't use the index for the $group because there's no index on userId.

Optimization: Add a compound index {status: 1, createdAt: 1, userId: 1} and reorder the pipeline to {$match}, {$sort}, {$group}, {$sort}. MongoDB will use the index for $match, skip unnecessary records, and stream sorted results to $group for memory-efficient aggregation. Explain plan with: `db.orders.aggregate([...], {explain: true})` shows COLLSCAN vs IXSCAN stage.

Run the explain command: `db.orders.aggregate([{$match: {status: "completed", createdAt: {$gte: ISODate("2024-01-01")}}}, {$group: {_id: "$userId", total: {$sum: "$amount"}}}, {$sort: {total: -1}}], {explain: true})`. Look for "stage": "COLLSCAN"—if present, the $match didn't use an index. Check "nReturned" vs "totalDocsExamined"—if they're vastly different, you're scanning too much.

Follow-up: If adding an index isn't possible due to storage constraints, what other optimization techniques would you use? How would you use $limit to short-circuit the pipeline?

You're building a real-time dashboard aggregation that needs to complete in <100ms across a 500M document collection. The query groups by region, calculates 10 aggregations (sum, avg, min, max, etc.) per region, and returns top 5 regions. Current implementation uses a single $group stage with all 10 accumulators. Your MongoDB instance has 64GB RAM. How would you optimize for latency?

Single $group stage scanning 500M documents is too slow. Optimization strategy: (1) Use allowDiskUse: true to enable external sorting for large groups, but this adds latency—not ideal for <100ms; (2) Pre-aggregate into a materialized view or time-series collection updated every minute, then query the pre-aggregated data—reduces scan from 500M to thousands; (3) Use $limit and $skip early to reduce docs flowing through pipeline; (4) Enable aggregation on a SECONDARY with local read preference to avoid PRIMARY load.

Best approach for real-time dashboard: Create a summary collection updated by a scheduled aggregation every 5 minutes. In your dashboard, query the summary instead. For exact real-time data, use MongoDB 5.1+ time-series collections with $densify and windowing operators which are optimized for time-series analytics.

Example materialized view: `db.createCollection("region_summary", {viewOn: "events", pipeline: [{$match: {createdAt: {$gte: new Date(Date.now() - 3600000)}}}, {$group: {_id: "$region", total: {$sum: "$amount"}, ...}}, {$sort: {total: -1}}, {$limit: 5}]})`. Then query `db.region_summary.find().limit(5)` in <10ms.

Follow-up: How would you invalidate the materialized view when new data arrives? Would you use change streams to trigger re-aggregation?

Your aggregation pipeline uses $lookup to join users collection (50K docs) with orders collection (10M docs) using {localField: "userId", foreignField: "_id"}. The $lookup happens after $match filters to 5M orders. Pipeline: [{$match: {...}}, {$lookup: {from: "users", ...}}, {$project: {...}}]. This runs in 30 seconds and uses 8GB memory. How do you optimize?

The $lookup executes after $match has filtered to 5M documents, then performs a nested loop join: for each of 5M orders, lookup the user. Even with an index on users._id, this is expensive. The 8GB memory usage indicates MongoDB is buffering the entire joined result.

Optimizations: (1) Move $match to filter both sides of the join—use $lookup with a sub-pipeline: {$lookup: {from: "users", let: {userId: "$userId"}, pipeline: [{$match: {$expr: {$eq: ["$_id", "$$userId"]}}}, {$project: {name: 1, email: 1}}]}}, which pushes filtering into the join; (2) Use $limit before $lookup to reduce cardinality—if you only need top 100 orders, add {$limit: 100} before $lookup; (3) Pre-join in the application: fetch orders, batch collect unique userIds, fetch users, merge in application code (trades latency for memory efficiency).

Best practice: Denormalize user data into orders collection during write time if you frequently join them. Example: store {userId: ..., userName: ...} in orders to avoid $lookup entirely. Check pipeline explain output for "stage": "EQ_LOOKUP" (hash join, memory-intensive) vs "stage": "INDEX_JOIN" (uses index, faster).

Follow-up: If you denormalize, how do you keep the denormalized userName in sync when users update their names?

You're debugging a pipeline that processes monthly reporting: [{$facet: {summary: [...], details: [...]}}]. The $facet executes both sub-pipelines in parallel, but one takes 60 seconds (details) while the other takes 2 seconds (summary). Your application waits 60 seconds for both results. You have 16 cores available. Explain the bottleneck and how to optimize.

$facet executes sub-pipelines in parallel within a single aggregation. However, both results are returned together as a single document, so the overall pipeline latency is the maximum of the two sub-pipelines. Since details takes 60s and summary takes 2s, the application waits 60s for the entire response.

Optimization: Split into two separate aggregations if you don't need atomic consistency: execute summary and details independently and merge results in application code. This allows frontend to display summary immediately while details load asynchronously. Alternatively, add $limit to details or add a final $skip stage to the slower sub-pipeline to reduce its cardinality.

For reporting queries, consider background pre-computation: run both pipelines on a SECONDARY every hour and cache results, then serve from cache. Real-time reporting trades latency for consistency.

Diagnostic command: `db.collection.aggregate([...], {explain: true})` to see stages for each $facet sub-pipeline. Check "executionStages" for each and identify which is consuming most time (look for COLLSCAN or SORT stages). If details is doing a full collection scan, add an index to enable IXSCAN.

Follow-up: If details actually needs all 60 seconds (legitimate complex analysis), how would you refactor to show partial results to users faster?

Your aggregation uses $graphLookup to find all ancestors of a document in a hierarchical tree (org chart with 1M employees, max depth 20 levels). The pipeline: [{$match: {_id: empId}}, {$graphLookup: {from: "employees", startWith: "$reportingManagerId", connectFromField: "_id", connectToField: "reportingManagerId", as: "ancestors", restrictSearchWithMatch: {active: true}}}]. This returns 18 ancestors and takes 8 seconds. How would you optimize?

$graphLookup performs a breadth-first search (BFS): starts at one node, finds connected nodes, recurses. With 20 levels and 1M employees, each BFS step may touch thousands of documents. The 8-second latency is due to 20 sequential lookups (one per level).

Optimization approaches: (1) Denormalize the ancestor chain at write time: store ancestors array directly in employee doc—reduces BFS to 0 seconds read time but adds write complexity; (2) Use a materialized path pattern: store manager path as "/ceo/manager1/manager2/" for fast prefix queries without recursion; (3) Reduce searchDepth if 20 levels is overkill: add {maxDepth: 5} if you only need immediate chain; (4) Cache ancestor relationships in Redis—build cache during off-peak, query in real-time.

For org charts specifically, use the nested set model or closure table: pre-compute all ancestor-descendant relationships in a separate collection, then simple index lookup returns all ancestors in <5ms. Example closure table: {ancestorId: A, descendantId: B, depth: 3} for all pairs.

Diagnostic: Check explain output for $graphLookup stage—if it shows high "totalDocsExamined" relative to "totalKeysExamined", the restrictSearchWithMatch filter is too loose (scanning all active employees per level). Tighten the filter or add more indexes.

Follow-up: Design a schema to store org hierarchy for O(1) ancestor lookup. What are the tradeoffs vs $graphLookup?

Your analytics pipeline needs to calculate a percentile (95th) for latency metrics across 100M requests. You're using $group with a $push accumulator: {_id: "$service", latencies: {$push: "$latencyMs"}} followed by $project with $percentile (MongoDB 7.0). This runs in 25 seconds and uses 12GB RAM. All 100M documents fit in RAM (server has 256GB). Why is it slow?

The pipeline sorts all 100M latencies into arrays grouped by service. Even though they fit in RAM, the $push operation internally sorts/buffers each group's array, consuming CPU and causing spill-to-disk if the aggregation buffer (100MB default) is exceeded. With 100M docs, you're likely spilling to disk multiple times, explaining the 25-second latency.

Optimization: (1) Use $percentile with weights/frequency encoding instead of raw values—if latencies are bucketed, you can approximate percentiles in O(n) instead of storing all values; (2) Use $bucketAuto to pre-bucket latencies into ranges, reducing cardinality: {$bucketAuto: {groupBy: "$latencyMs", buckets: 100}}, then calculate percentile across buckets; (3) Run aggregation on replica set SECONDARY with local read preference to avoid PRIMARY load; (4) Increase allowDiskUseSize if you can tolerate slower queries with more disk I/O.

For time-series data like latencies, use MongoDB time-series collections (5.0+) with built-in $percentile operators optimized for this exact use case. They store data in columnar format, reducing memory footprint and speeding up aggregations by 10x.

Diagnostic: Run with `explain: true` and look for "stage": "SORT" before $percentile. If present, you're sorting all latencies—unnecessary for percentile calculation (can use streaming algorithm). Check working set size with `db.serverStatus().memory.resident` to confirm if you're thrashing disk.

Follow-up: Implement a streaming percentile algorithm that doesn't require storing all values. How would you adapt it for MongoDB?

You're implementing real-time anomaly detection: for each incoming event, calculate whether it's an outlier based on the distribution of the last 10,000 similar events. Your pipeline uses $facet to compute mean and stddev, then a $project to flag outliers. This needs to complete in <500ms per event. You're currently running per-event aggregations (10,000 scans each), causing high CPU and slow response. How would you redesign?

Per-event aggregations are inefficient—you're recalculating mean/stddev from scratch for each event. Better approach: pre-compute rolling statistics in a summary collection, updated incrementally.

Design: Maintain a stats collection with {_id: eventType, count: 10000, sum: S, sumSq: SS, mean: M, stddev: SD} updated via change streams. When new event arrives, check if abs((value - M) / SD) > 3 (3-sigma rule for outliers). This is O(1) lookup vs O(10000) scan.

Implementation: (1) Create change stream listener on events collection; (2) For each INSERT, atomically update stats: db.stats.findOneAndUpdate({_id: eventType}, {$inc: {count: 1, sum: value, sumSq: value*value}, $set: {lastUpdated: now}}); (3) Trigger recalculation of mean/stddev in background every 1000 inserts; (4) Query stats in real-time to detect outliers.

If you need fresh statistics, run async recalculation: `db.events.aggregate([{$match: {type: eventType, createdAt: {$gte: ISODate(Date.now() - 86400000)}}}, {$group: {_id: null, count: {$sum: 1}, mean: {$avg: "$value"}, stddev: {$stdDevSamp: "$value"}}}], {allowDiskUse: true})` in background, write results to cache.

Follow-up: How would you handle the case where incoming events have a bursty distribution (e.g., 1000 events then idle for 10 minutes)? Would rolling statistics bias the anomaly detection?

Your aggregation includes a $redact stage to mask sensitive fields based on user role: {$redact: {$cond: [{$eq: [user.role, "admin"]}, "$$KEEP", {$cond: [{$in: ["$field", user.allowedFields]}, "$$KEEP", "$$PRUNE"]}]}}. This runs on 5M documents but the $in operator is expensive. When you run it, the pipeline takes 45 seconds instead of 2 seconds without $redact. How would you optimize data masking?

The $redact with $in is O(n*m) where n is documents and m is allowedFields array size. With 5M docs and checking each field against allowedFields, this is a performance killer. The 45-second latency confirms this.

Optimization: (1) Pre-filter at query time: instead of $redact, use $match to filter which collection to query based on user role—different users query different collections or views; (2) Use MongoDB Views with role-based filtering: create separate views for admin vs user, each with different projection; (3) Move masking to application layer after cursor fetch—$redact is powerful but slow; retrieve documents and filter/mask in code (usually faster due to better caching); (4) If you must use $redact, convert allowedFields to a Set and use $cond with direct field checks instead of $in: replace $in with {$or: [{$eq: ["$field", "allowedField1"]}, {$eq: ["$field", "allowedField2"]}, ...]} (faster for small arrays).

Best practice: security/masking should happen at the application or query level, not in aggregation pipeline. Use MongoDB Query Language projection: db.collection.find({...}, {sensitiveField: 0}) is faster than $redact for simple cases. For field-level encryption, use MongoDB 4.4+ Queryable Encryption (server-side FLE) which handles masking before documents leave the server.

Follow-up: Design a schema and query strategy to support role-based field masking for 50 different user roles without creating 50 separate collections.