MongoDB Interview — Document Size Limits and Workarounds

Your MongoDB document stores a list of customer orders: {_id, customerId, orders: [{orderId, items: [{...}, {...}, ...]}, ...]}, where a high-value customer has 100K orders with thousands of line items each. The document size reaches 25MB, exceeding MongoDB's 16MB document size limit. Inserts fail with "document exceeds maximum size". How would you refactor the schema?

The 16MB limit is per-document in MongoDB. Embedding 100K orders in one customer doc violates this. Refactoring options: (1) Collection per entity type: split into separate collections: customers {_id, customerId, name, ...} and orders {customerId, orderId, items: [...]}. Query becomes: find customer, then find orders where customerId=X. (2) Normalize deeply: split orders into {_id, customerId, orderId} and order_items {_id, orderId, items: [...]}. Further reduces document size; (3) Pagination/chunking: if you must embed orders, split into pages: {_id: "customer_1_page_1", customerId, orders: [...] (1000 orders per page)}. Then multiple documents store chunks of orders; (4) Lazy load: store only recent orders inline: {_id, customerId, recentOrders: [{...}] (last 10), totalOrderCount}, then fetch full order history on demand from separate collection.

Recommended: option 1 (separate collections). Orders are naturally queried independently (find all orders by status, by date range, etc.), so separate collection makes sense. Schema: `customers: {_id, name, ...}`, `orders: {_id, customerId, orderId, items, status, createdAt}`. Index on customerId for fast lookup.

Follow-up: If your application frequently accesses customer + recent orders together (50% of queries), how would you optimize for this access pattern without violating 16MB limit?

You store large media files: {_id, userId, fileName, metadata: {...}, fileData: }. MongoDB's 16MB limit is hit when storing the raw binary fileData. Your current approach: split into chunks {_id, uploadId, chunkIndex, data: }. To retrieve file, you fetch all chunks, concatenate in memory. This works but forces 50MB into application memory. For 100 concurrent users, you risk OOM (5GB memory for 100 files). Design a better approach.

Storing large blobs in MongoDB isn't ideal. Solutions: (1) Use GridFS: MongoDB's GridFS system automatically chunks large files into 256KB chunks, stores in two collections (fs.files metadata, fs.chunks data). Retrieval streams chunks, avoiding loading entire file into memory. GridFS handles chunking transparently: `bucket.openUploadStream("file.bin").end(largeBinary)` and `bucket.openDownloadStream(fileId).pipe(res)` (streaming, not loading all); (2) External storage (recommended): store files in S3/GCS, reference in MongoDB: `{_id, userId, fileName, s3Key: "uploads/user123/file123", metadata: {...}}`. MongoDB stores only metadata, files stored on S3. Retrieval: query MongoDB for s3Key, fetch directly from S3. Decouples storage from MongoDB, scales better; (3) Compress before storing: compress file with gzip, reduces 50MB to 5-10MB (if text-heavy). Decompress on retrieve: `compressed = gzip(data)`, store in MongoDB chunks, fetch and gunzip on retrieve; (4) Streaming upload/download: even with chunks in MongoDB, use streaming instead of loading all chunks into memory. Fetch one chunk at a time, pipe to output stream.

Recommended for production: use S3 + MongoDB metadata. Application logic: upload file to S3 with presigned URL (browser-direct upload to S3, bypassing application), store S3 reference in MongoDB. Download: query MongoDB for s3Key, redirect browser to S3 (presigned URL). Zero storage burden on MongoDB.

Follow-up: Design a system that stores 1 billion media files (totaling 100TB) with MongoDB metadata and S3 storage, supporting fast retrieval and cost optimization.

You have a document with nested objects reaching 15.9MB (just under 16MB limit). When you add a new field dynamically (application bug inserts extra data), the document grows to 16.1MB. MongoDB rejects the update with "document exceeds maximum BSON size". Now you can't update that document at all (even to fix the bug)—it's stuck. How do you recover from exceeding the 16MB limit?

Once a document exceeds 16MB, MongoDB won't accept any updates (inserts, updates, replacements all fail). The document is essentially frozen. Recovery options: (1) Export and refactor: export the oversized document, parse BSON externally, split into smaller documents, re-import: `mongoexport --query '{"_id": ObjectId("...")}' > export.json`, refactor export.json (split into multiple docs), `mongoimport export.json`. This recovers the data without MongoDB limiting you; (2) Delete and recreate: if you can afford data loss/rollback, delete the oversized doc, re-insert sanitized version (without the extra data that caused overflow); (3) MongoDB support intervention: contact MongoDB support (if Enterprise support), they might have internal tools to recover; (4) Direct BSON repair: write custom script to decompose BSON on storage layer (very low-level, risky); (5) Rename collection and reconstruct: dump collection, filter out oversized doc, reimport to new collection, drop old.

Prevention: monitor document size before updates. Before inserting/updating, estimate new size: `estimatedSize = bson.encode(doc).length`. If > 14MB, don't proceed (leave 2MB buffer for growth). Implement size check in application: `if (docSize + newDataSize > 15_000_000) throw new Error("Document too large")`.

Follow-up: Design a system that automatically splits oversized documents before they hit the 16MB limit, without requiring application changes.

Your schema embeds user addresses: {_id, userId, addresses: [{street, city, state, zip, country, metadata: {...}}, ...]}, where power users have 1000+ addresses. Each address is ~200 bytes, so 1000 addresses = 200KB. With other user data, document is 300KB (safe under 16MB). However, your queries are slow when fetching user with large address list: `find({_id}).project({addresses: 1})` loads entire 200KB address array, then application filters to find a specific address. Why is this slow, and how would you optimize?

Loading entire address array into memory (200KB per user) then filtering in application is inefficient. Optimizations: (1) Query filtering on array: use $elemMatch to filter on server side: `find({_id}, {addresses: {$elemMatch: {city: "New York"}}})` returns only matching addresses (100 docs instead of 1000), faster transfer and processing; (2) Array indexing: create index on addresses: `db.users.createIndex({"addresses.city": 1})`. Query becomes: `find({"addresses.city": "New York"}, {addresses: {$elemMatch: {city: "New York"}}})` uses index for filtering (fast scan), then $elemMatch projects only matched addresses; (3) Separate collection: if address queries are common and complex, move addresses to separate collection: `{_id, userId, city, state, ...}`. Query: `find({userId, city: "New York"})`. Simpler queries, can paginate addresses easily; (4) Aggregation with unwind: `aggregate([{$match: {_id}}, {$unwind: "$addresses"}, {$match: {"addresses.city": "New York"}}, {$group: {_id: "$_id", matchedAddresses: {$push: "$addresses"}}}])` unwinds array, filters, re-groups. Efficient on server.

Recommended: use $elemMatch projection + index on addresses.city. Code: `db.users.createIndex({"addresses.city": 1}); find({_id}, {addresses: {$elemMatch: {city: "New York"}}})`

Follow-up: If users can have millions of addresses (exceeding 16MB with just addresses), how would you redesign for scalability?

Your MongoDB document approaching 16MB contains nested objects 5 levels deep: `{level1: {level2: {level3: {level4: {level5: {data: [...]}}}}}}`. When you query and project `{_id, level1.level2.level3.level4.level5.data}`, MongoDB fetches entire document (16MB), parses nested structure, and returns only the deep field (100KB). Network latency and parsing overhead make queries slow. How would you optimize deeply nested access?

MongoDB BSON parsing and projection on deeply nested fields is CPU-intensive. Even though you project a small field, MongoDB must deserialize entire document to navigate 5 levels. Optimizations: (1) Flatten schema: restructure to: `{_id, level5_data: [...]}` instead of nested 5-level structure. Query becomes direct access, no parsing overhead; (2) Use aggregation with $project to pluck nested field: `aggregate([{$match: {_id}}, {$project: {data: "$level1.level2.level3.level4.level5.data"}}])` is sometimes more efficient than find().project(); (3) Denormalize frequently-accessed nested fields: copy deep field to top-level: `{_id, level5_data: [...], level1: {level2: {...}}}`. Index on level5_data, query directly; (4) Store nested objects in separate collection: if level5 data is queried independently, move to separate collection with reference: `{_id, userId, level5_id}`, and `level5_table: {_id, data: [...]}`. Query: find in level5_table (small collection), faster; (5) Compress nested levels: if nested objects are read-only, pre-serialize to binary: `{_id, nested_binary: bsonEncode({level1: ...})}`. On read, deserialize only the needed part using binary offset.

Recommended: flatten top 2-3 levels, keep nesting for truly hierarchical data only. If level5 is queried 90% of the time, move to top-level or separate collection.

Follow-up: Design a schema that supports 10 levels of nesting with efficient queries on any level without violating 16MB limit.

Your documents store arrays with hundreds of thousands of elements: `{_id, sensorId, readings: [{timestamp, value}, ...] (1M elements)}`. Updating a single reading in the middle of the array with `updateOne({_id}, {$set: {"readings.500000.value": 99}})` fails: the updated document would exceed 16MB (even though the change is small). MongoDB won't apply the update because resulting doc would be oversized. How do you handle updates on huge arrays?

When updating an array element, MongoDB must rewrite the entire document. If rewritten doc > 16MB, update is rejected. This is problematic for append-only workloads (e.g., sensor readings that grow indefinitely).

Solutions: (1) Array chunking: split large array into multiple documents: `{_id: "sensor_1_chunk_0", readings: [elements 0-999]}, {_id: "sensor_1_chunk_1", readings: [elements 1000-1999]}`. Updates target specific chunk (smaller doc, fits under 16MB). Query all chunks to retrieve full array; (2) Capped collections: use capped collection with auto-rollover: `db.createCollection("readings", {capped: true, size: 100000000, max: 1000000})` auto-deletes oldest readings when capacity reached. Never exceeds cap size (and 16MB limit if cap is set appropriately); (3) Time-series collection: for timestamp-value data, use MongoDB 5.0+ time-series collection: `db.createCollection("readings", {timeseries: {timeField: "timestamp", metaField: "sensorId"}}`. Internally handles large arrays efficiently, no 16MB constraint on readings; (4) Separate timestamp-based collections: instead of one array, create per-day collections: `readings_2024_01_15, readings_2024_01_16`. Updates target current day (small, fits under 16MB). Historical access: query multiple day collections; (5) Don't update in-place: append-only model—insert new reading instead of updating: `insertOne({_id: new ObjectId(), sensorId, timestamp, value})`. Query latest N: `find({sensorId}).sort({timestamp: -1}).limit(N)`.

Recommended: use time-series collection (#3) or append-only model (#5). Avoid updating arrays within docs that grow indefinitely.

Follow-up: Design a sensor data storage system that supports 1M sensors, each sending 100 readings/sec, with efficient storage and query performance.

You store a document with a binary field {_id, fileId, binaryData: }. Queries on this collection are slow: `find({fileId})` takes 500ms even though it's indexed. Analysis shows MongoDB is deserializing the entire 10MB binary field even though query doesn't need it. Projecting it out: `find({fileId}, {binaryData: 0})` drops latency to 10ms. However, application sometimes needs binaryData. How would you design for both scenarios?

Including large binary fields in queries forces MongoDB to deserialize them even if not needed. This impacts latency for all queries that don't explicitly exclude the field.

Design patterns: (1) Separate collection for binary: split into two collections: `files_metadata: {_id, fileId, filename, size}` and `files_data: {fileId, binaryData}`. Queries on metadata don't fetch binary (metadata collection is small, fast queries). When binary needed, explicitly query files_data; (2) Lazy loading: store reference to binary in external storage (S3), fetch separately when needed. Document: `{_id, fileId, s3Key: "..."}`, binary queried on-demand from S3; (3) Binary compression: compress binary before storing: `compressed = zlib.compress(binaryData)` (10MB -> 2-3MB), reduces BSON size. Decompress on retrieval; (4) Chunking with index: split binary into chunks `{_id, fileId, chunkIndex, data: }`. Query on fileId returns chunk references only, fetch only needed chunks; (5) Default projection: enforce projection on collection query: `find({fileId}, {binaryData: 0})` by default in application. Explicitly fetch binaryData only when needed: separate query or pipeline.

Recommended: separate collections (#1). Metadata collection is tiny (1KB docs), queries are instant. When binary needed, explicit second query to files_data collection (same design as option #2, but using MongoDB internal storage instead of S3).

Follow-up: Design a document storage system that handles 100M files (100TB total) with fast metadata queries and efficient binary retrieval on-demand.