MongoDB Interview Questions

Atlas Search and Full-Text Capabilities

questions
Scroll to track progress

You implement product search on MongoDB Atlas Search: index products collection on name and description fields. Query: `db.products.aggregate([{$search: {text: {query: "wireless laptop RTX", path: ["name", "description"]}}}])`. This returns results, but relevance ranking seems off: a product with "laptop" in the title appears below a product with "wireless laptop RTX" in the description. You expected title matches to rank higher. Design the ranking strategy.

Atlas Search uses Lucene-style relevance scoring (TF-IDF, BM25). By default, all fields have equal weight. A document with all three terms in description might score higher than one with partial terms in title, depending on corpus statistics.

To prioritize title matches: (1) Use boost parameter on fields: `{text: {query: "...", path: [{value: "name", multi: "standard"}, {value: "description"}], boost: 10}}` boosts name matches 10x; (2) Use function score: combine text search with custom scoring based on field weights: `{compound: {must: [{text: {query: "...", path: ["name", "description"]}}], score: {function: "euclideanDistance", decay: {...}}}}`. This applies exponential decay to distance from title field; (3) Separate searches: query name field separately from description, boost name results. Example: `db.products.aggregate([{$search: {text: {query: "...", path: "name"}}}, {$skip: 0}, {$limit: 100}])` to find matches in title, then union with description matches.

Recommended: use boost parameter in index configuration. In MongoDB Atlas, create search index: `{"analyzer": "lucene.standard", "mappings": {"dynamic": false, "fields": {"name": {"type": "string", "analyzer": "lucene.standard", "boost": 10}, "description": {"type": "string"}}}}`

Follow-up: Design a multi-factor ranking system that considers relevance score, product popularity, and recency without hardcoding weights.

Your Atlas Search index has autocomplete feature: as user types "lap", suggest "laptop", "laptop bag", "laptop stand". You implement with a prefix search query. However, with 1000 products and autocomplete queries hitting MongoDB frequently (every keystroke), queries are slow (500ms+) and Atlas Search index is hot. How would you optimize autocomplete?

Autocomplete on full product search is expensive: each prefix generates a search query that scans index. With "lap" prefix, Atlas Search looks for all products matching "lap" and re-ranks—expensive for frequent queries.

Optimization strategies: (1) Use autocomplete analyzer in Atlas Search: create search index with `"autocomplete"` analyzer (Lucene's EdgeNGramAnalyzer). This pre-tokenizes words into prefixes at index time. Query: `{autocomplete: {query: "lap", path: "name"}}` uses pre-indexed prefixes, much faster; (2) Cache autocomplete results: maintain a separate collection `autocomplete_suggestions: [{term: "laptop", count: 150}, ...]` (pre-computed). As user types "lap", query autocomplete collection (simple filter), not full search index. Update suggestions nightly; (3) Client-side filtering: return all products (cached), client filters locally as user types. Requires downloading product catalog (~1MB), acceptable for retail; (4) Trie structure: maintain a trie on the server or embedded in MongoDB. Prefix lookup is O(length of prefix), very fast. Query: `db.autocomplete_trie.findOne({_id: "l"}).children["a"].children["p"]` returns suggestions; (5) Separate lightweight index: create a second Atlas Search index on just {name} with autocomplete analyzer and fewer fields. Lower index size, faster queries.

Recommended: use Atlas Search autocomplete analyzer. In index config: `{"analyzer": "lucene.standard", "mappings": {"dynamic": true, "fields": {"name": {"type": "autocomplete", "minGrams": 2, "maxGrams": 15}}}}`

Follow-up: If you have 10M products, how would you implement autocomplete with <100ms response time?

You build a support ticket search system using Atlas Search. Search index includes ticket title, description, and resolution. Query: `db.tickets.aggregate([{$search: {text: {query: "ssl certificate error", path: ["title", "description", "resolution"]}}}])`. Results include tickets with "SSL", "certificate", and "error" scattered in different fields. However, you want phrase search: only return tickets mentioning "ssl certificate error" as a phrase (consecutive words). Current results include false positives. Design a phrase search.

Atlas Search text search tokenizes words and matches independently. Query "ssl certificate error" matches: {title: "ssl issue", description: "certificate revoked", resolution: "error in logs"}—false positive (words are scattered).

Phrase search solutions: (1) Use phrase operator in Atlas Search: `{phrase: {query: "ssl certificate error", path: "description"}}` matches exact phrase. But searches single field; (2) Use near operator: `{near: {path: "description", origin: "ssl", pivot: 2}}` matches words within 2 words of "ssl" (proximity search, not exact phrase); (3) Use compound query: combine must (all terms) with should (phrase boost): `{compound: {must: [{text: {query: "ssl"}}], should: [{phrase: {query: "ssl certificate error"}}]}}` returns all docs with "ssl" but boosts those with the exact phrase; (4) Rewrite query as regex: use regex operator (slower): `{regex: {query: "ssl.*certificate.*error", path: "description", allowAnalyzedField: true}}`

For support tickets, recommended: use phrase search on description + title separately: `db.tickets.aggregate([{$search: {phrase: {query: "ssl certificate error", path: ["title", "description"]}}}])`. Atlas Search automatically searches both fields for the phrase.

Verify: test query on ticket "ssl certificate error in renewal" (should match) vs "ssl renewal certificate error" (phrase broken, shouldn't match if strict phrase needed).

Follow-up: Design a semantic search system that finds conceptually similar tickets even if exact phrase doesn't match.

Your MongoDB collection has documents with language diversity: English, Spanish, German, French. You create a single Atlas Search index with `"analyzer": "lucene.standard"`. When searching Spanish documents for "teléfono", it doesn't match "telefono" (with accent). When searching German for "München", it doesn't match "Munchen" (umlaut removed). Your searches are language-specific but index doesn't handle accents/diacritics. How would you support multilingual search?

Lucene.standard analyzer doesn't normalize accents by default. "teléfono" and "telefono" are treated as different tokens. To support accent-insensitive search, use analyzers that normalize diacritics.

Solutions: (1) Use language-specific analyzers: Atlas Search has analyzers for major languages. Create separate search index per language: index_english (analyzer: "lucene.english"), index_spanish (analyzer: "lucene.spanish"), index_german (analyzer: "lucene.german"). These normalize accents appropriately for each language; (2) Use custom analyzer with token filters: configure custom analyzer with "asciifolding" token filter (removes accents): `{"analyzer": "custom", "tokenizer": "standard", "tokenFilters": ["lowercase", "asciifolding"]}`; (3) Store both normalized and original: at index time, store stripped accent version: `{name: "teléfono", name_normalized: "telefono"}`. Index both fields, query on normalized; (4) Post-process queries: strip accents from user query before searching: `const normalized = query.normalize('NFD').replace(/[\u0300-\u036f]/g, '')` (JavaScript example).

Recommended: use language-specific analyzers if documents have language tags. In Atlas Search index, use conditional mapping: `{mappings: {dynamic: true, fields: {name_en: {type: "string", analyzer: "lucene.english"}, name_es: {type: "string", analyzer: "lucene.spanish"}}}}`. Queries specify field based on language.

Follow-up: Design a global search system for documents in 10+ languages that supports typos, accents, and synonyms.

You implement Atlas Search with faceted navigation: search for "laptop" and display facets showing {category: {Electronics: 150, Computers: 200}, brand: {Dell: 100, HP: 50}}. Your aggregation pipeline: `{$search: {...}}, {$facet: {results: [{$limit: 20}], categories: [{$group: {_id: "$category", count: {$sum: 1}}}]}}`. This works but after running for weeks, you notice facet counts are inconsistent with total results. Some facets show 500 products but total results show 300. Why?

Facet aggregation with $facet processes sub-pipelines independently. The results sub-pipeline limits to 20 docs, categories sub-pipeline groups all matched docs. However, if the $search stage has already applied limits or if document count changes between stages, facets become inconsistent.

More likely: facet query is multi-batch. If search returns 10K docs but you only fetch first 20 for display, categories facet still counts all 10K. From user's perspective (seeing only 20 results), facet shows 500 categories but they only see 3 categories in results—confusing.

Fix: (1) Apply limit before faceting: `{$search: {...}}, {$limit: 1000}, {$facet: {...}}` ensures facets are computed on limited result set; (2) Compute facets on search results only: restructure to: `{$search: {...}}, {$limit: 20}, {$facet: {results: [{$skip: 0}, {$limit: 20}], categories: [{$limit: 1000}, {$group: ...}]}}` limits facet aggregation; (3) Separate queries: facet counts and search results as two separate queries. Facet: `db.products.aggregate([{$search: {...}}, {$group: {_id: "$category", count: {$sum: 1}}}])`, results: `db.products.aggregate([{$search: {...}}, {$limit: 20}])`. More network roundtrips but clearer separation; (4) Use MongoDB aggregation facet correctly: ensure facet aggregation matches the search filter scope.

Recommended: apply limits before $facet to keep counts synchronized with displayed results.

Follow-up: Design a faceted search system that scales to 100M documents with sub-second response times and accurate facet counts.

Your product search uses Atlas Search with a vector search (embedding-based search). You index product descriptions as embeddings using a machine learning model (OpenAI embeddings). Query: `db.products.aggregate([{$search: {cosmicSearch: {vector: queryEmbedding, path: "embedding", k: 10}}}])` returns 10 most similar products. However, relevance is sometimes poor: searching "laptop" returns products like "monitor" or "keyboard" that have similar embedding but aren't laptops. How would you improve semantic search?

Embedding-based search is semantic (meaning-based) but can be too loose. "Monitor" and "keyboard" are computer peripherals with similar embeddings to "laptop". Pure embedding search lacks keyword specificity.

Hybrid search strategy: (1) Combine keyword + semantic search: `{compound: {must: [{text: {query: "laptop", path: "name"}}], should: [{cosmicSearch: {vector: queryEmbedding, path: "embedding", k: 100}}]}}` enforces keyword match while using embeddings for ranking. Keyword "must" eliminates irrelevant peripherals; (2) Filter before semantic search: add filter stage before search: `db.products.aggregate([{$match: {category: "Electronics"}}, {$search: {cosmicSearch: ...}}])` narrows semantic search to relevant category; (3) Multi-stage ranking: rank by keyword relevance first, then re-rank by embedding similarity: `{compound: {must: [...], boostedBy: [{cosmicSearch: ...}]}}` applies keyword score first, boosts with semantic similarity; (4) Custom scoring: combine keyword BM25 score + embedding similarity with weights: `score = 0.7 * keyword_score + 0.3 * embedding_similarity`. Adjustable weights based on domain.

For product search: use hybrid search. Query "laptop" should have keyword match enforced (name/category), embeddings used for ranking. This avoids peripherals in results while maintaining semantic flexibility.

Follow-up: Design a search system that handles both exact-match queries ("RTX 4090 laptop") and semantic queries ("laptop for gaming") with appropriate ranking.

You deploy Atlas Search to 100 production indexes across dozens of applications. Index build time varies: some rebuild in 5 minutes, others take 1-2 hours. During a critical bug fix requiring index rebuild across all 100, you trigger `db.collection.reIndex()` on all. After 2 hours, 70 indexes are rebuilt but 30 are stuck at 30% for 4+ hours. You need to roll back but don't know which indexes are stable. Design index build monitoring and rollback strategy.

Index rebuild bottlenecks: large collections (1-10B docs), complex analyzers, network latency to Atlas. Some collections rebuild quickly (1-10M docs), others slowly (1-10B docs). Without monitoring, you can't distinguish stuck (errored) from slow (progressing).

Monitoring strategy: (1) Track rebuild progress: `db.currentOp()` shows active operations with `progress` field. Poll every minute: if progress stuck at 30% for >15 minutes, likely errored; (2) Log rebuild events: enable profiling or change streams to capture index rebuild start/completion times; (3) Implement health check: after rebuild, verify index stats match pre-rebuild (e.g., index size should be similar). If mismatch, rebuild failed silently.

Rollback strategy: (1) Keep old index: don't drop old index before new rebuild completes. New index builds alongside old. After completion, validate new index, then drop old: `db.collection.dropIndex(oldIndex)`. If new index fails, old remains active; (2) Stage rollout: rebuild 10 indexes, validate, then 10 more. If batch fails, roll back batch independently; (3) Create shadow index: build new index with suffix `_v2`, verify, then switch traffic. If issues, switch back to `_v1`.

For your case: immediately check `db.currentOp()` on stuck indexes. If stuck >15 minutes, kill operation: `db.killOp(opid)`. This prevents cascading resource consumption. Investigate root cause (collection size, CPU saturation, network issues) before retrying.

Follow-up: Design an automated index management system that detects stalled builds, retries failed indexes, and maintains a rollback log.

Want to go deeper?