MongoDB Interview — Time Series Collections

You store 1 year of server metrics (CPU, memory, disk, latency) at 1-second granularity: 31.5M documents per server, 1000 servers = 31.5B total documents. Your current schema: {serverId, timestamp, cpu, memory, disk, latency}. With regular collection, this takes 500GB disk. You migrate to time-series collection. Disk usage drops to 80GB (6x reduction). However, your queries are now slower. Query: `db.metrics.find({serverId: "server-1", timestamp: {$gte: yesterday}}).sort({timestamp: -1}).limit(100)` takes 100ms before migration, 800ms after. Why?

Time-series collections use internal compression (columnar storage)—metrics are stored by field (all CPU values together, all timestamps together), not by document. This compression reduces storage dramatically but changes query access patterns.

When you query `{serverId, timestamp}`, time-series collection stores: serverId array + timestamp array + cpu array + memory array. To return 100 documents, it must decompress all relevant columns (timestamp, cpu, memory, etc.), reconstruct 100 documents, then sort. This decompression + reconstruction is expensive (especially with sorting).

Query optimization for time-series: (1) Use time range only: query timestamp range only (not serverId if possible), let MongoDB use built-in time-series indexes; (2) Avoid sorts: time-series naturally orders by time; reversing with `.sort({timestamp: -1})` requires decompressing and re-sorting; (3) Use aggregation pipeline: `aggregate([{$match: {serverId, timestamp: {...}}}, {$limit: 100}])` is faster than find() for large result sets; (4) Pre-aggregate: instead of querying raw 1-second data, store 1-minute aggregates (avg CPU per minute) in separate time-series collection. Queries on 1-minute aggregates return 60x fewer docs—much faster.

Recommended: (1) Accept slower find() on individual metrics (or optimize queries); (2) Use aggregation for complex queries; (3) Maintain downsampled (1-minute) collection for historical queries.

Follow-up: Design a metrics storage system that maintains 1-second resolution for 7 days, 1-minute resolution for 30 days, 1-hour resolution for 1 year, with efficient querying across all resolutions.

You create a time-series collection for IoT sensor data: `db.createCollection("sensors", {timeseries: {timeField: "timestamp", metaField: "sensorId", granularity: "seconds"}})`. Data arrives from 10K sensors at 100 msgs/sec = 1M docs/day. After 1 month, collection has 30M docs but query performance is good (sub-second for time-range queries). However, you notice that raw disk size is still large (50GB), and MongoDB isn't aggressively compressing. Investigation shows: data from different sensors has very different granularity (some sensors send every second, others send every 10 seconds). How would you optimize storage?

Time-series compression works best with consistent, fine-grained data. If some sensors send every 1 second and others every 10 seconds, compression effectiveness is reduced (missing values require more bookkeeping). Additionally, mixing high-frequency and low-frequency data in same collection reduces compression ratio.

Storage optimization: (1) Use different collections per frequency: create `sensors_high_freq` (granularity: "seconds") for frequent sensors, `sensors_low_freq` (granularity: "minutes") for infrequent ones. Each collection compresses better (consistent granularity); (2) Use seconds granularity but pre-process: aggregate low-frequency sensors to fill missing seconds with last-known-good value: `{timestamp: 1704067200, sensorId: "s1", value: 50.0}, {timestamp: 1704067201, sensorId: "s1", value: 50.0}, ...`. This increases document count slightly but improves compression; (3) Post-collection downsampling: store raw data in time-series with fine granularity, automatically downsample to coarser granularity via background job. Keep raw for 7 days, 1-minute aggregates forever; (4) Adjust granularity: MongoDB's granularity parameter (seconds, minutes, hours) hints compression. If your data is truly every 10 seconds, use `granularity: "minutes"` to get better compression.

Recommended: separate high/low frequency sensors. High-freq in one collection, low-freq in another. This provides 2-3x better compression.

Follow-up: How would you handle mixed-frequency sensors that vary over time (e.g., sensor starts at 1/sec, then switches to 1/min)?

You delete old data from time-series collection: `db.metrics.deleteMany({timestamp: {$lt: new Date(Date.now() - 30*24*60*60*1000)}})` to remove data older than 30 days. This completes quickly, and db.collection.stats() shows data size decreased. However, disk usage on storage system remains unchanged—30GB still used instead of dropping to 20GB. Why doesn't deleting from time-series collection reclaim disk space?

Time-series collections use compressed buckets (internal MongoDB structure). Each bucket contains data from a time window (e.g., 1 hour of data from one sensor). When you delete documents, MongoDB marks space as free internally but doesn't re-compress buckets or reclaim disk space immediately. This is similar to regular collections with WiredTiger (lazy space reclamation).

To reclaim disk: (1) Run compact: `db.metrics.compact()` rewrites collection and buckets, eliminating fragmentation and reclaiming disk space. Takes time (blocking for the collection); (2) Use TTL index: instead of manual deletes, create `db.metrics.createIndex({timestamp: 1}, {expireAfterSeconds: 2592000})` (30-day TTL). TTL indexes automatically delete expired documents and space is reclaimed in background (more efficient than manual deleteMany); (3) Use multi-tier storage: store raw time-series for 30 days, then auto-migrate to downsampled collection at 1-hour granularity. Keep detail for 30 days, aggregates forever; (4) Accept unfragmented space if query performance is acceptable (space reclamation isn't critical if queries are fast).

Recommended: use TTL indexes instead of manual deletes. TTL is optimized for time-series data: `db.metrics.createIndex({timestamp: 1}, {expireAfterSeconds: 2592000})`

Follow-up: Design a data retention policy that keeps raw metrics for 30 days, 1-hour aggregates for 1 year, 1-day summaries for 5 years, all automatically managed.

You query a time-series collection for the latest value from each sensor: `db.metrics.aggregate([{$group: {_id: "$sensorId", latestValue: {$last: "$value"}, latestTime: {$last: "$timestamp"}}}])`. This query scans billions of documents and takes 5 minutes. A time-series collection should be optimized for time-range queries, not group-by-sensor. How would you optimize finding latest values?

The issue: $group over all documents requires full collection scan. Even though time-series stores data in time buckets, grouping by sensorId requires examining all data. With billions of docs, this is slow.

Optimization: (1) Use $match before $group to filter time range: `aggregate([{$match: {timestamp: {$gte: now - 1h}}}, {$group: {_id: "$sensorId", ...}}])` reduces scope from billions to millions; (2) Create a "latest" collection: maintain a separate collection `latest_metrics: {_id: "sensor-1", timestamp: ..., value: ..., cpu: ...}` with one doc per sensor (latest value). Update this collection on each sensor write (via application logic or triggers). Then "get latest" is a find by sensor ID (O(1)); (3) Use $sort + $limit trick (if allowed): `aggregate([{$sort: {timestamp: -1}}, {$limit: 10000}, {$group: {...}}])` sorts first by timestamp (assuming index), limits to most recent 10K docs, then groups. If latest value for each sensor is within recent 10K docs, this works; (4) Use materialized views: create a view that groups by sensor, maintained by background job: `db.createCollection("latest_by_sensor_view", {viewOn: "metrics", pipeline: [{$sort: {timestamp: -1}}, {$group: {_id: "$sensorId", latestValue: {$first: "$value"}}}]})` (every query runs aggregation, so not much faster).

Recommended: maintain separate "latest" collection. Write path: `db.metrics.insertOne(doc); db.latest_metrics.updateOne({_id: sensorId}, {$set: doc}, {upsert: true})`. Read latest: `db.latest_metrics.find({_id: sensorId})` (instant). Trade: extra write overhead but query speed 1000x faster.

Follow-up: Design a real-time monitoring dashboard that displays latest metrics from 100K sensors with sub-second response time.

Your time-series collection stores metrics with varying fields: some documents have {cpu, memory, disk}, others have {network, bandwidth}. Queries mix fields: sometimes query cpu, sometimes query network, etc. MongoDB's time-series compression assumes consistent schema. With inconsistent fields, is compression effective? How would you organize this?

Time-series columnar compression works best with consistent fields across all documents. If documents have different fields, compression treats them as sparse data (missing values add overhead). Example: document 1 has {cpu, memory} but no network, document 2 has {network} but no cpu. Compression must track which fields are present per document—less effective.

Schema organization: (1) Use separate collections per metric type: `metrics_system` for {cpu, memory, disk}, `metrics_network` for {network, bandwidth}. Each collection has consistent schema, compresses better. Queries specify which collection to query; (2) Use nested objects to simulate columns: `{timestamp, sensorId, system: {cpu, memory, disk}, network: {bandwidth, latency}}` keeps related fields together. Compress each nested object independently; (3) Fill sparse fields with null or 0: if a sensor doesn't have network data, insert `{network: null}` to maintain consistent schema. Compression handles sparse better than missing fields (uses space but allows compression); (4) Denormalize within metaField: use metaField to include sensor type: `{timestamp, metadata: {sensorId, type: "system"}}, system: {...}`. When querying, filter by type first, reducing scope.

Recommended: separate collections per metric type (system metrics vs network metrics). This provides 2-3x better compression and simpler queries. `db.metrics_system.find()` vs `db.metrics_network.find()` instead of filtering mixed collection.

Follow-up: Design a schema that supports flexible sensor types (CPU, network, disk, temperature, humidity) while maintaining compression efficiency and allowing ad-hoc field additions.

You use time-series collection for stock price data. Data arrives out-of-order: price for 10:30 AM arrives at 10:35, then 10:28 arrives at 10:40. Your time-series collection configured with `granularity: "seconds"` but MongoDB is rejecting inserts of old timestamps with "Timestamp too far in the past". MongoDB's time-series requires inserts to be mostly in-order. How would you handle out-of-order data?

Time-series collections require timestamps within a certain window (typically 30 minutes past the oldest doc in an internal bucket). Out-of-order data violates this. If 10:30 arrives, the collection creates a bucket for the 10:30-10:30+5min window. When 10:28 arrives (earlier), it can't fit in the 10:30 bucket and violates the ordering constraint.

Solutions: (1) Increase `maxUncommittedDocuments` or use a staging collection: insert out-of-order data into a regular (non-time-series) staging collection first. Periodically batch, sort, then insert into time-series collection in order; (2) Use `ordered: false` on bulk inserts: `insertMany(docs, {ordered: false})` skips documents that fail (due to timestamp constraints) and continues inserting valid ones; (3) Design for expected latency: if data can arrive up to 10 minutes late, account for this in collection configuration and app logic. Insert with expected timestamp (when event occurred), not timestamp of arrival; (4) Use separate "late arrivals" collection: time-series for in-order data, regular collection for late arrivals. Periodically merge late arrivals into time-series via reprocessing; (5) Disable strict ordering: some MongoDB versions allow relaxing timestamp ordering constraints (check server configuration).

For financial data, recommended: staging collection. Insert all data (in-order and late) into staging, sort by timestamp daily/hourly, bulk insert into time-series. This ensures time-series is always in-order and optimized.

Follow-up: Design a multi-tier ingestion system (real-time staging, batch processing, late arrival handling) for streaming financial data with sub-second query requirements.

Your time-series collection has 1 year of data (365B docs). You need to query: "total CPU usage for all servers over the last 24 hours". This aggregation scans billions of docs and takes 30 minutes. You can't afford 30-minute queries in production. Your current aggregation: `aggregate([{$match: {timestamp: {$gte: yesterday}}}, {$group: {_id: null, totalCPU: {$sum: "$cpu"}}}])`. Design a faster solution.

Aggregating 24 hours from 365 days of data requires scanning billions of docs (even with time-series compression). The $group stage is expensive when grouping over huge dataset.

Solutions: (1) Pre-aggregate by hour: create hourly summary collection via background job: `hourly_summaries: {_id: "2024-01-15T10:00Z", totalCPU: 5000, count: 1000}`. Query becomes: `aggregate([{$match: {_id: {$gte: yesterdayStart, $lt: todayStart}}}, {$group: {_id: null, totalCPU: {$sum: "$totalCPU"}}}])` scanning 24 docs instead of billions (instant); (2) Use materialized views: MongoDB 4.4+ supports $merge stage to maintain running aggregates: `aggregate([{$match: {...}}, {$group: {...}}, {$merge: {into: "summary_collection", ...}}])` updates summary whenever raw data updates (async); (3) Use time-based partitioning: partition collection into daily sub-collections (metrics_2024-01-15, metrics_2024-01-16). Query only yesterday's partition: `db.metrics_2024-01-15.aggregate([{$group: {...}}])` scans only 1-day data (~billions docs if millions docs/day, not 365B); (4) Columnar indexing: create index on (timestamp, cpu) to speed up scans. With index, MongoDB can jump to timestamp >= yesterday and only scan relevant pages; (5) Use analytics database: offload aggregations to purpose-built analytics DB (BigQuery, Redshift). Stream time-series data to analytics DB nightly, run 30-min query there instead of MongoDB.

Recommended: pre-aggregate hourly summaries. Run background job every hour: `aggregate([{$match: {timestamp: {$gte: lastHour}}}, {$group: {...}}], {$merge: {into: "hourly_summaries"}})`. Total query time drops from 30min to <100ms.

Follow-up: Design a multi-tier aggregation system (5-min, 1-hour, 1-day summaries) with automated rollup that answers any time-range query in <100ms.