MongoDB Interview — Indexing Strategies: Compound, Covered, Partial

You have a 100M document orders collection queried frequently with multiple patterns: (1) Find orders by userId and date range; (2) Find by status and amount range; (3) Find by region and last-updated timestamp. You create separate indexes for each, but MongoDB reports 5 indexes on the collection. Your insert latency increased 40% due to index maintenance. How would you design compound indexes to support all three query patterns efficiently with fewer indexes?

Each index adds write overhead—more indexes means more maintenance during inserts. However, you can't support all three query patterns with a single compound index. ESR rule (Equality, Sort, Range) guides index design: (1) userId + createdAt compound index supports query 1; (2) status + amount compound index supports query 2; (3) region + updatedAt compound index supports query 3. This is the minimum: 3 indexes.

However, if query patterns overlap (e.g., queries 1 and 3 both filter by createdAt), you could merge: {userId: 1, region: 1, createdAt: 1} might work for both if queries are selective on userId OR region first. But this sacrifices selectivity—the compound index must be traversed in order.

Better approach: (1) Keep 3 focused indexes as planned; (2) Optimize write performance elsewhere: use bulk inserts instead of individual inserts (reduces index maintenance overhead per insert); (3) Run index optimization on SECONDARY with index background: `db.collection.reIndex()` rebuilds indexes in background without blocking production; (4) Monitor index fragmentation and remove unused indexes: `db.collection.getIndexes()` lists all, remove ones with no queries using `db.collection.dropIndex()`.

Verify compound index usage: Run explain on query: `db.orders.find({userId: id, createdAt: {$gte: t1}}).explain("executionStats")` and check "executionStages.stage" is "IXSCAN" (not COLLSCAN). If IXSCAN, confirm "keysExamined" is close to "nReturned" (efficient index usage).

Follow-up: If you had to support all 3 query patterns with ONLY 1 index, how would you redesign your queries or schema?

You create a compound index {status: 1, amount: -1, createdAt: 1} for query: db.orders.find({status: "completed", amount: {$gt: 100}}).sort({amount: -1}). You expect MongoDB to use the index for both filtering and sorting, but explain shows it uses IXSCAN for filter and then an in-memory SORT stage. Why isn't the index sorting?

The issue is index field ordering. For compound index to be used for sorting, the sort fields must appear in the index in the exact same order as the query's sort clause, and all previous equality/range fields must be specified. In your index {status: 1, amount: -1, createdAt: 1}:

Query filters on status (equality), amount (range with $gt), and sorts by amount descending. For the index to cover both filtering and sorting, MongoDB needs: equality fields first (status), then sort field (amount must be second). Your index has amount in the right position, but range fields (amount with $gt) break sort optimization.

Correct index for this query: {status: 1, amount: -1} (drop createdAt). This allows status equality to use index, then amount $gt range to use index, and finally amount descending sort to use index—all in one index pass. If you need createdAt for other queries, create separate index {status: 1, createdAt: 1}.

Verify: Run explain again with index {status: 1, amount: -1}. Check "executionStages" has no "SORT" stage (sorting done by index). If SORT stage is missing, sorting is handled by index—good. If SORT stage exists, sorting is in-memory—bad.

Rule: For compound indexes with sorting, put equality fields first, then sort fields in sort order. Range fields can interleave but may prevent sort optimization.

Follow-up: Design an index for queries that filter on status, range on amount, sort by amount, and also support a separate query filtering by region and sorting by createdAt. Can one index serve both?

You optimize a query by creating a covered index {userId: 1, status: 1, amount: 1} for: db.orders.find({userId: id, status: "completed"}, {_id: 0, status: 1, amount: 1}). Explain shows "executionStages": {stage: "IXSCAN"} with "totalDocsExamined": 0. This means MongoDB never touched the collection, reading entirely from the index—perfect! However, you notice this only works if you explicitly project {_id: 0}. Why?

MongoDB indexes always include the _id field by default. A covered query requires all requested fields to be in the index. Your index {userId: 1, status: 1, amount: 1} doesn't include _id explicitly. However, MongoDB assumes _id is needed and tries to fetch it from the collection—defeating coverage.

By explicitly projecting {_id: 0}, you're telling MongoDB "I don't need _id", so it can satisfy the query entirely from index. If you project {_id: 1, status: 1, amount: 1}, MongoDB must visit the collection to fetch _id, making the query uncovered.

To cover queries that include _id: either (1) project {_id: 0} as you did, or (2) create a compound index that explicitly includes _id: {userId: 1, status: 1, amount: 1, _id: 1}. But this is redundant since _id is implicit.

In practice, covered queries are useful for: (1) Queries that don't need _id (analytics, aggregations); (2) Queries that need only indexed fields for filtering+sorting. For typical application queries that need all fields or _id, coverage is hard to achieve—just optimize to use index for filtering (IXSCAN vs COLLSCAN) instead.

Verify coverage: Run explain and check "executionStats.executionStages.stage". If "IXSCAN" with "totalKeysExamined" close to "nReturned" and no second stage, it's covered.

Follow-up: Design a covered index strategy for an analytics dashboard that needs to compute aggregations without touching collection data.

Your orders collection has 50M documents. 90% are status="completed", 10% are other statuses. You run frequent queries finding incomplete orders: db.orders.find({status: {$ne: "completed"}}). This query must scan 45M documents (completed ones) to find 5M incomplete ones. You create a partial index on {status: 1} with partial filter expression {status: {$ne: "completed"}}. However, the query still uses COLLSCAN. Why?

Partial index query filtering is explicit. MongoDB won't use the partial index unless your query matches the filter expression exactly. Query `db.orders.find({status: {$ne: "completed"}})` matches the partial index filter logically, but MongoDB requires the query predicate to be a superset of the filter expression for index usage.

Better approach: Invert the schema logic. Instead of storing status as a single field with $ne, use a boolean flag: {_id, isCompleted: true/false}. Create partial index on {_id: 1} with {isCompleted: false}. Query becomes: `db.orders.find({isCompleted: false})` which now matches the partial index exactly and uses IXSCAN.

Alternatively, if you can't change schema, use explicit partial index: `db.orders.createIndex({_id: 1}, {partialFilterExpression: {status: {$ne: "completed"}}})` and query as: `db.orders.find({status: {$ne: "completed"}, $expr: {$exists: true}})` (force index usage). This is hacky but works—MongoDB sees the query has a predicate matching the partial filter.

Best practice: Use partial indexes when you have a clear subset of documents (e.g., active users only). Design queries to match the filter expression exactly. Verify with explain: "executionStages.stage" should be "IXSCAN" for partial index queries.

Follow-up: Design a multi-tenant schema using partial indexes where each tenant's queries only touch their own documents.

You need to support searches on a large text field (product descriptions, ~1KB each) across 10M products. Exact string matching requires COLLSCAN. You create a text index: `db.products.createIndex({description: "text"})`. Query: `db.products.find({$text: {$search: "laptop gaming RTX"}})` returns results in 1 second. However, relevance isn't ranked—you get results randomly ordered. Also, the text index consumes 5GB of disk (50% of collection size). Is text indexing the right approach?

MongoDB text indexes are good for basic full-text search but have limitations: (1) Relevance isn't ranked (results unordered or by insertion order); (2) Index size is large (50%+ of collection size due to inverted index); (3) Limited language support compared to dedicated search engines; (4) Text indexes can't efficiently handle complex queries like phrase search or fuzzy matching.

For production search with ranking, use MongoDB Atlas Search (powered by Lucene) or Elasticsearch. But if you're on self-hosted MongoDB, alternatives: (1) Implement relevance scoring in application: after text search returns results, calculate relevance in code based on term frequency, position, etc., then sort; (2) Use a dedicated search service: Elasticsearch or Solr alongside MongoDB (MongoDB for data, ES for search)—they sync via change streams; (3) Denormalize search tokens: split description into tokens at write time, store as array, query with $in on most relevant tokens, then sort by match count.

Example denormalization: `{_id, description, search_tokens: ["laptop", "gaming", "RTX", "cpu", ...]}` (top 50 tokens). Query: `db.products.find({search_tokens: {$in: ["gaming", "RTX"]}}).sort({score: -1})` (score computed at write time). This is faster than text index and more flexible.

For your case: if full-text search is core to product, migrate to Atlas Search or Elasticsearch. If it's secondary feature, use denormalization approach with pre-computed tokens and relevance scores.

Follow-up: Design a search system that ranks products by relevance, popularity, and recency without using external search engines.

You create indexes on a production collection and notice index size grows over time even though collection size is stable. After 2 months, indexes are 3x larger than when created. Explain why, and how would you optimize?

Index fragmentation: as documents are inserted and deleted, the B-tree structure of the index becomes fragmented—empty blocks accumulate, increasing disk size without increasing functional index size. This is similar to filesystem fragmentation.

Solutions: (1) Rebuild index: `db.collection.reIndex()` defragments all indexes, reclaiming space. Rebuilds in foreground (blocks collection) or background (if WiredTiger); (2) Drop and recreate: `db.collection.dropIndex(indexName)` then `db.collection.createIndex(...)` achieves same effect; (3) Monitor fragmentation: compare logical index size vs physical disk size using db.collection.stats() and db.collection.stats().indexSizes. If indexSizes["indexName"] >> logical size, fragmentation is high; (4) Periodic maintenance: reindex once a month during off-peak to keep indexes healthy.

For production, use background reindex to avoid blocking: `db.collection.reIndex({background: true})` (WiredTiger 4.0+). Alternatively, use rolling reindex: reindex on one SECONDARY, promote to PRIMARY after reindex completes, reindex old PRIMARY, repeat.

Prevention: Avoid creating/dropping indexes frequently. Plan index strategy upfront and maintain relatively stable indexes. Monitor index growth rate and alert if growth > data growth (indicates fragmentation).

Follow-up: Design an automated maintenance job to detect and fix index fragmentation without impacting production queries.

You're designing indexes for a time-series collection with 1B documents (1 year of metrics at 1-second resolution). Queries typically ask for a specific metric over a time range: `db.metrics.find({metric: "cpu_usage", timestamp: {$gte: start, $lt: end}})`. You create compound index {metric: 1, timestamp: 1}. However, as time passes and you keep writing new data, the index becomes increasingly fragmented due to insertions at the "end" of the range. Each query takes longer. How would you optimize this time-series workload?

The issue: with 1B documents over 1 year, the timestamp range is huge. New inserts always append to the "end" of the timestamp range, causing index fragmentation (hot pages in B-tree keep splitting). Additionally, queries always filter by metric (high cardinality), then timestamp range—the index {metric: 1, timestamp: 1} means each metric's data is scattered across index (low selectivity on metric alone).

Better approach: Use MongoDB 5.0+ time-series collections instead of regular collections. Time-series collections are optimized for this exact workload: they use columnar storage (timestamps, metrics, values stored separately), auto-compress old data, and queries are 10x faster. Example: `db.createCollection("metrics", {timeseries: {timeField: "timestamp", metaField: "metric", granularity: "seconds"}})`. Then query as normal—internally optimized.

For non-time-series collections: (1) Partition by time: instead of 1 billion docs in one collection, create monthly collections: metrics_2024_01, metrics_2024_02. Index each monthly collection—fragmentation is local and indexes are much smaller; (2) Shard by metric ID: `sh.shardCollection("metrics", {metric: 1})` distributes metrics across shards, each shard's index is smaller and less fragmented; (3) Use descending timestamp: {metric: 1, timestamp: -1} might help if queries often ask for recent data first (latest N values before a time).

Recommended: Migrate to time-series collection for modern MongoDB versions. Verify query performance improves by >5x with explain and check IXSCAN vs COLLSCAN.

Follow-up: Design a metrics system on MongoDB time-series collections. How would you handle retention (delete data older than 1 year)?

You create an index on {email: 1} to enforce uniqueness: `db.users.createIndex({email: 1}, {unique: true})`. A user with an unset email field inserts successfully. Later, another user with unset email also inserts, violating uniqueness. But the index allowed it. Explain why, and how would you fix?

MongoDB treats missing fields as NULL for indexing purposes. With unique index on {email: 1}, MongoDB considers documents with missing email as having email: null. The index allows multiple null values by default—this is a quirk of MongoDB indexing.

To fix: (1) Make email required in schema validation: `db.createCollection("users", {validator: {$jsonSchema: {required: ["email"]}}})`; (2) Create sparse unique index: `db.users.createIndex({email: 1}, {unique: true, sparse: true})`. Sparse index ignores documents with missing field, so multiple missing emails are allowed in collection but at most one non-null email per value; (3) Provide default value: `db.users.insertOne({email: user.email || generateUniqueId()})` to ensure email is always present and unique.

Sparse unique index is the best fix if you want to allow missing emails: `db.users.createIndex({email: 1}, {unique: true, sparse: true})` allows many documents with missing email but enforces uniqueness on provided emails. Verify: insert multiple users with no email (succeeds), insert two users with same email (fails).

For strict uniqueness including null: don't use sparse, and ensure schema validation requires email field.

Follow-up: Design a schema that handles optional email for some documents (e.g., social login users who don't provide email) but still enforces uniqueness on provided emails.