MongoDB Interview — Schema Design Patterns and Anti-Patterns

You design a schema for a social media platform: users have profiles, posts, comments, and likes. Initial design embeds everything: {_id, name, posts: [{content, likes: [userId1, userId2, ...], comments: [{text, author, likes: [userId1, ...]}]}]}. After 6 months, a power user has 50K posts, each with 10K likes and 5K comments. Querying that user's profile now takes 30 seconds. Explain what's wrong and redesign.

The fundamental issue: you've embedded unbounded arrays. MongoDB has 16MB max document size, and even before hitting that, loading 50K posts with all likes/comments into memory is slow. Each query to fetch the user also fetches all posts, all likes, and all comments—a massive payload for what should be a quick profile fetch.

Redesign using collection separation: (1) Users collection: `{_id, name, bio}`; (2) Posts collection: `{_id, userId, content, createdAt}`; (3) Likes collection: `{_id, postId, userId, createdAt}`; (4) Comments collection: `{_id, postId, userId, text, createdAt}`. Now fetching user profile is O(1) by _id. Fetching user's posts: `db.posts.find({userId}).limit(20)` (paginated). Fetching post details with likes/comments: use $lookup in aggregation or application joins.

Schema with bounded arrays: If you must embed, use a "hybrid" pattern: embed only frequently-accessed, bounded data. Example: `{_id, name, recentPosts: [{_id, title}], postCount, likeCount}`. Store top 5 recent posts inline, full posts in separate collection. Update recentPosts array whenever user publishes (bounded to 5), skip updating all 50K posts.

Verification: After migration, confirm: (1) User profile query takes <100ms; (2) Listing posts uses index on {userId, createdAt}; (3) No document exceeds 5MB; (4) Aggregation joins use $lookup efficiently (check explain for IXSCAN on foreign key).

Follow-up: How would you migrate from embedded design to separated collections without downtime? What happens during migration?

You design a product catalog schema: {_id, name, categories: ["electronics", "computers"], tags: ["laptop", "gaming", "RTX"]}. Your application frequently filters by category and tag combinations. You create a single compound index {categories: 1, tags: 1} hoping queries like db.products.find({categories: "electronics", tags: "gaming"}) use it. However, MongoDB reports the query uses a COLLSCAN, not IXSCAN. Why?

Array fields in compound indexes are tricky. MongoDB creates an index entry for each combination of array elements. With categories = ["electronics", "computers"] and tags = ["laptop", "gaming", "RTX"], the index creates 6 entries (2 x 3). The query db.products.find({categories: "electronics", tags: "gaming"}) looks for documents where electronics is in categories AND gaming is in tags—this should use the index.

The issue is likely in query structure or field naming. Verify: (1) Run `db.products.getIndexes()` to confirm index exists; (2) Run explain: `db.products.find({categories: "electronics", tags: "gaming"}).explain("executionStats")` to see why COLLSCAN. Check "rejectedPlans" to see if MongoDB considered the index but rejected it.

Common reason for index rejection: if you're querying with $or or using the index violates query constraints. Also check data type: if categories is string in some documents and array in others, index might be inconsistent.

Fix: (1) Ensure field is consistently array type; (2) Create separate indexes if combination doesn't work: `{categories: 1}` and `{tags: 1}` separately, let MongoDB use intersection or choose best one; (3) Use text search if filtering complexity grows—create text index `{categories: "text", tags: "text"}` and query with `$text: {$search: "electronics gaming"}`.

Follow-up: How would you design the schema and indexes for faceted search (filtering by multiple categories simultaneously)?

You build a hierarchy schema for a company org chart. Employees have managers: {_id: "emp1", name: "Alice", managerId: "emp2"}. To find all employees under a manager recursively, you use $graphLookup which takes 5 seconds for a 100-person department. Your VP of Engineering complains that loading the org chart on the web UI is too slow. What schema patterns would speed this up?

$graphLookup performs breadth-first search and is O(depth) in database roundtrips. For deep org charts, this is slow. Three patterns to optimize: (1) Materialized path: store path as string, e.g., "/CEO/VP-Eng/Manager/". Query becomes: `db.employees.find({path: {$regex: "^/CEO/VP-Eng/Manager"}})` using index on path—O(1) with prefix index; (2) Nested set model: store left and right boundaries, e.g., {_id, left: 10, right: 25}. Descendants have left > 10 and right < 25. Query becomes range scan: `db.employees.find({left: {$gt: 10, $lt: 25}})` with IXSCAN; (3) Closure table: maintain separate collection {ancestorId, descendantId, depth}. Each insert adds O(depth) rows but queries become trivial: `db.closure.find({ancestorId: managerIdId})` to find all direct reports, or `db.closure.find({ancestorId, depth: 1})` for direct reports only.

For 100-person org, materialized path is simplest: store "{ceo}/engineering/alice" and queries use index prefix. To update when manager changes, only update the affected employee (not entire subtree like nested set). To find descendants: `path.startsWith("/engineering/")` which hits the index.

Trade-offs: Path requires denormalization (update on each move); nested set is complex to maintain; closure table adds write overhead. For your VP's org chart, closure table is best if org changes infrequently (moves are rare).

Follow-up: Implement the closure table pattern. How would you efficiently rebuild it after a bulk reorganization?

You design a versioning schema for documents: {_id, data, version: 1}. When data changes, you increment version and store: {_id, data, version: 2}. After 1 year with daily updates, your document has version: 365. You now run a query to find all "payment" documents: db.payments.find({status: "completed"}). This query is slow because documents match the query but version field is unrelated. Should you use a separate history collection?

The issue isn't directly the version field (it's only an integer), but the schema design violates atomicity. When you "update version", you're actually replacing the entire document in MongoDB. This means: (1) Old versions are lost (unless you store them elsewhere); (2) Each update requires writing the entire document back to disk (if document grows, storage fragmentation increases); (3) Queries can't efficiently search across versions.

Better schema: (1) Use separate history collection: `db.payments` stores current state, `db.payment_history` stores all versions. When payment changes, insert history record and update current. Query current state is fast; retrieve history if needed. Example: `db.payments.findOne({_id})` vs `db.payment_history.find({_id: paymentId})` for all versions; (2) If version history is small (<10), store in array with size limit using capped arrays: `{_id, status, versions: [{v: 1, status: "pending", at: ...}, {v: 2, status: "completed", at: ...}]}`. Use `{$push: ..., $slice: -10}` to keep only last 10 versions.

Current design (storing full doc per version) is called "full history" and is inefficient. Better: store only deltas (field changes) to reduce storage. Example: `{_id, status: "completed", changes: [{field: "status", from: "pending", to: "completed", at: ts}]}`. Query works the same, storage is 1/10th.

Follow-up: Design a schema that supports efficient point-in-time queries ("show me status as of Jan 1") without storing full history per version.

You design a multi-tenant SaaS platform schema: {_id, tenantId, data}. You store all users from 1000 tenants in a single collection. To ensure isolation, you add a unique compound index {tenantId: 1, email: 1} so each tenant can have only one user per email. However, you discover that queries are still slow even though you're filtering by both tenantId and email. Analysis shows that queries use COLLSCAN instead of index. Why might this happen?

If queries are filtering by both tenantId and email with a compound index on {tenantId: 1, email: 1}, MongoDB should use IXSCAN. Possible reasons for COLLSCAN: (1) Query filters on email without tenantId: `db.users.find({email: "test@example.com"})` can't use the compound index efficiently—need separate index on just {email: 1}; (2) Query filters on wrong field: `db.users.find({tenantId, userId})` doesn't match the index; (3) Data type mismatch: if tenantId is stored as both string and ObjectId, index is unreliable; (4) Index not actually created yet or dropped.

Best practice for multi-tenant: create multiple indexes strategically. Index {tenantId: 1, email: 1} for tenant+email queries, index {tenantId: 1, createdAt: -1} for listing users by tenant, maybe index {email: 1} for cross-tenant lookups if needed. Don't over-index—each index adds write overhead.

Verify: Run `db.users.find({tenantId: tid, email: email}).explain("executionStats")`. If "executionStages.stage" is "COLLSCAN", the index isn't being used. Check `db.users.getIndexes()` to confirm compound index exists. If it does, check if MongoDB is choosing a different index by examining "rejectedPlans".

Alternative: use separate collections per tenant (sharding by tenantId). This avoids compound indexes and natural isolation, but adds operational complexity. For <1000 tenants, single collection with proper indexing is simpler.

Follow-up: How would you design a multi-tenant schema to prevent tenant data leakage? What query patterns should you enforce?

You're storing user profiles: {_id, name, email, settings: {theme: "dark", notifications: true, language: "en"}}. Some users have detailed settings (10 fields), others minimal. You query for users with theme="dark": db.profiles.find({"settings.theme": "dark"}). This uses an index on {"settings.theme": 1}. However, the query returns 500K matches out of 10M users. Your aggregation pipeline then filters these 500K users further, taking 30 seconds total. How would you optimize?

The query uses index correctly, but filtering 500K documents through aggregation is slow. Root cause: your predicates aren't selective enough. If 5% of all users have theme="dark", you're retrieving too much data.

Optimization: (1) Add more specific index: instead of just {"settings.theme": 1}, use compound index {"settings.theme": 1, "settings.language": 1, createdAt: -1} if your aggregation also filters by language and date. This reduces result set before aggregation; (2) Use $match early in aggregation to filter before expensive $group/$lookup stages. Move $match to the beginning of pipeline to reduce documents flowing through; (3) Add a "bucket" or "hash" field for query optimization: add {userSegment: hash(theme, language) % 100} to pre-filter into buckets, query only relevant buckets; (4) Denormalize common queries: maintain {_id, theme, language, ...} at top level, aggregate settings in separate nested object only if needed.

For this case: add index {"settings.theme": 1, "settings.language": 1} and include language in your match: `db.profiles.find({"settings.theme": "dark", "settings.language": "en"})` should return much fewer docs. Verify with explain—if nReturned is now 50K instead of 500K, aggregation will be 10x faster.

Follow-up: Design a schema for user settings that scales to 100 user preferences without creating a bloated document or dozens of indexes.

You implement a denormalization pattern: store user.name in both the users collection and in comments collection (for display efficiency). After 6 months, a user changes their name from "Alice" to "Alison". You update the users collection, but forget to update comment records. Now when you display comments, old comments show "Alice" but the user profile shows "Alison"—inconsistency. How do you maintain denormalized data without this divergence?

Denormalization creates consistency problems. Solutions: (1) Use change streams to sync denormalized copies: listen on users collection for name changes, then atomically update all comments with matching userId: `db.users.watch([{$match: {operationType: "update"}}]).on("change", (change) => { db.comments.updateMany({userId: change.fullDocument._id}, {$set: {userName: change.fullDocument.name}}) })`. This keeps copies in sync; (2) Accept eventual consistency: denormalized data may be stale temporarily. Add TTL or version to denormalized fields—after 24h, refresh from primary source; (3) Denormalize only immutable or rarely-changing data: store userId and userName at write time, but if userName changes, don't propagate. Instead, display userName from user lookup at read time (slower but consistent). For comments, store userId (immutable) only, fetch user details at query time; (4) Use events/messaging: when user updates name, publish event to message queue, separate worker processes and updates all denormalized copies asynchronously.

Best practice: denormalize only for performance-critical paths and accept eventual consistency. For your comment case, store userId in comment (immutable), fetch user.name separately or embed user details at write time and refresh on read (using cache). Avoid storing mutable data in denormalized form unless you have automated sync.

Follow-up: Design a change stream listener that maintains denormalization across 5 related collections without creating circular updates.

You design a schema for storing hierarchical configuration: {_id, app: "myapp", version: 1, config: {db: {host: "localhost", port: 27017}, cache: {ttl: 3600}}}. Your application loads this at startup. When you scale to 1000 microservices all loading config at the same time, MongoDB sees a thundering herd: 1000 concurrent queries for config. You add caching on the application side, but cache invalidation on update becomes complex. Design a schema that scales better.

Thundering herd problem: all instances query MongoDB simultaneously on startup, causing load spike. Solutions: (1) Use publish-subscribe pattern: microservices connect to a config server (can be MongoDB with pub/sub), server pushes config updates to all subscribers instead of pulling; (2) Implement hierarchical defaults: each microservice only stores overrides from parent/default config. Query for overrides (small document), apply to defaults in-app; (3) Use a config service with caching layer: config server caches computed config in Redis, microservices query cache first (cache hit rate 99%+), only occasionally hit MongoDB; (4) Version-based cache busting: store version number in config, microservices cache locally keyed by version. When version changes, app checks version (single metadata query), invalidates cache only if version differs.

Schema redesign: Split config into stable (rarely changes) and dynamic (changes frequently). Store separately: `db.config_base` (immutable, updated weekly) and `db.config_overrides` (per-service, updated frequently). Microservice loads base once at startup (cached), queries overrides for its service ID (small document, cached). On override change, only that service's cache invalidates—no thundering herd.

Example: base config is 10MB (loaded once, cached in memory 1 year), overrides are 1KB per service (query by serviceId, cached locally). Startup: 1000 instances load 1MB from MongoDB (shared cache), hit disk once, memory 1000x times—no load spike.

Follow-up: Implement a config server system on MongoDB with push-based invalidation to 1000 microservices. How would you handle network failures?