MongoDB Interview — Sharding Architecture and Chunk Balancing

A sharded MongoDB cluster has 3 shards. Data distribution is unbalanced: shard-0 has 1TB, shard-1 has 500GB, shard-2 has 100GB. Balancer is enabled but hasn't rebalanced in 24 hours. Why?

Balancing is complex: MongoDB moves chunks (64MB chunks by default) between shards based on distribution. Slow balancing happens when: (1) chunk migration is expensive (network, I/O bottleneck). (2) balancer is throttled: migration happens only during off-peak. (3) shard key isn't optimal (queries still hot on one shard). Solutions: (1) manually trigger balancing: `sh.balance(false)` then `sh.balance(true)` to restart. (2) increase `chunkSize`: larger chunks = fewer moves, but less granular distribution. (3) optimize shard key: if key doesn't distribute evenly (e.g., country code with 90% US), rebalancing is limited by data distribution. (4) add more shards to spread load. Measure: monitor `db.currentOp()` for active migration, `sh.status()` to see distribution. If 1TB vs 100GB gap: balancer should be moving chunks. If not, likely throttled or migration-slow. Test: create new collection with better shard key, compare distribution.

Follow-up: How do you choose an optimal shard key for even data distribution?

After choosing shard key and sharding a collection, queries are still slow. Most queries hit one shard (90% traffic to shard-0). Shard-0 CPU is maxed, others idle. Is this a shard key problem?

Yes, likely shard key doesn't distribute queries evenly. Solutions: (1) analyze query patterns: which field is most commonly queried? If queries filter by country, but shard key is user_id, queries must scatter to all shards. (2) choose shard key that matches query patterns: if 90% queries are `{country: "US", ...}`, shard by country (but leads to uneven distribution). (3) use compound shard key: `{country: 1, timestamp: 1}` distributes queries and data better. (4) use hashed shard key: `{_id: "hashed"}` distributes queries randomly, good for even load. Measure: use `aggregate({$group: {_id: "$shard_key_field", count: {$sum:1}}})` to see distribution. For shard-0 hitting 90% traffic: likely queries can be optimized to skip some shards, or shard key needs change. Resharding is expensive; test shard key change in staging first.

Follow-up: How do you reshard a collection without downtime?

A chunk on shard-0 failed to migrate to shard-1 and is now stuck "in-flight". Other chunks can't migrate. Balancer is blocked. How do you recover?

Stuck migrations block balancing. Solutions: (1) identify stuck migration: `db.currentOp()` shows active migration. If hung, find chunk: `db.collection.getShardVersion()`. (2) abort migration: admin command `sh.abortReshardCollection({namespace: "db.coll"})` or restart mongod on stuck shard. (3) force-clear: `db.locks.deleteOne({_id: "chunk-migration"})` (risky—only if migration is truly hung). (4) restart balancer: `sh.stopBalancer()` then `sh.startBalancer()` may clear state. (5) check network: often migrations fail due to network timeouts between shards. Ensure connectivity. Measure: monitor balancer logs (`mongod --logpath`). For production: migration stuck >5min is critical. Alert and investigate immediately. Prevent: set migration timeout: `db.settings.updateOne({_id: "balancer"}, {$set: {migrationTimeout: 60000}})` (ms).

Follow-up: How do you implement monitoring for stuck migrations?

After resharding a collection from single shard to 3 shards, query performance degrades 50%. Queries that were fast on single shard now slower. Did resharding break performance?

Resharding can degrade performance if: (1) queries now scatter to all shards instead of hitting one (network overhead), (2) new shard key requires filtering after fetch (in-memory sort of merged results), (3) shard key has poor selectivity (many duplicates), causing uneven distribution. Solutions: (1) analyze query plans: `explain()` on original vs sharded. If scatterGather (multiple shards), consider shard key. (2) add indexes on shard key + query fields: `{shard_key: 1, query_field: 1}` speeds scatterGather. (3) if queries worked better on single shard, consider: was scale the problem? Or query pattern? If single shard was bottleneck, resharding to 3 should help load. Resharding is tradeoff: even distribution (good for scale) vs query locality (good for speed). Measure: before/after query latency percentiles. If p50 is same but p99 better, resharding is working (variance reduced due to even load). If all worse, shard key is suboptimal.

Follow-up: How do you optimize queries in a sharded environment?

A chunk is being balanced from shard-0 to shard-1. During migration, both shards have the chunk (temporarily duplicated). If a write happens during migration, which shard wins?

During chunk migration: shard-0 owns chunk, shard-1 is receiving copy. Writes are routed to shard-0. Solution: (1) writes go to shard-0 (original owner) until migration completes. (2) after migration, shard-1 becomes owner, shard-0 deletes chunk. (3) if write happens mid-migration and shard-0 crashes, shard-1 may be incomplete. MongoDB handles this via transactions (4.0+) and journaling. Measure: monitor migration progress with `db.currentOp()`. For safety: ensure replica sets (3+ members) so chunks are replicated. During migration, chunk is on replica set, replicated across members. If shard crashes mid-migration, replica set failover elects new primary. Best: migrations should be fast (<1s per chunk ideally). If slow, optimize network/disk I/O.

Follow-up: How do you implement safe chunk migrations with write guarantees?

Shard key chosen: `{user_id: 1}`. After sharding, data distribution is: shard-0: 500GB, shard-1: 100GB, shard-2: 50GB. Root cause: 500GB user has 10k documents, others have 10 on average. How do you fix?

Shard key has low cardinality (few distinct values) or is skewed (few values dominate). Solutions: (1) choose different shard key: instead of user_id alone, use `{user_id: 1, document_id: 1}` (compound). More granular, distributes better. (2) use hashed shard key: `{user_id: "hashed"}` distributes uniformly (but loses sort order). (3) accept imbalance: if that user is critical, they get dedicated shard. (4) pre-split chunks for that user: create more chunks for high-cardinality value. (5) analyze access patterns: if that user is 80% of queries, putting them on dedicated shard is optimal (avoid scatter). Measure: after resharding, verify distribution: `db.collection.aggregate([{$group: {_id: "$user_id", count: {$sum: 1}}}])`. For even distribution: each shard should have similar data size.

Follow-up: When is it acceptable to have uneven shard distribution?

A collection is sharded on `{region: 1}`. Queries like `{region: "US", status: "active"}` need to filter millions of documents on single shard. Query is slow. Index on `{region, status}` doesn't help much. Why?

Index helps, but if millions of documents match `{region: "US"}`, filtering takes time. Solutions: (1) add more selective field to shard key: `{region: 1, status: 1}` distributes by status too. But resharding is expensive. (2) use covered queries: if query only needs indexed fields, fetch is faster. (3) pre-filter: if most queries need status="active", pre-aggregate or cache results. (4) accept single-shard latency: if region="US" always hits one shard, optimize that shard (more RAM, faster disk). (5) denormalize: if query is common, pre-compute result in separate collection. Measure: query plan with `explain()`. If COLLSCAN (full scan), add index. If IXSCAN on `{region: 1, status: 1}` but millions of docs, filtering is I/O-bound. Optimize shard key or add more shards for that region.

Follow-up: How do you optimize single-shard queries that must filter millions of documents?

Balancer is enabled, chunks are moving. During balancing, a write gets error: "notPrimary" or "NamespaceNotFound". After retry, succeeds. What's happening?

During chunk migration, chunk temporarily unavailable. Writes get "notPrimary" or "NamespaceNotFound" if routed to migrating chunk. Solutions: (1) client-side retries: app should retry on "notPrimary" (this is expected). Most drivers auto-retry. (2) connection pooling: ensure connection has retry logic. (3) read preference: for reads during balancing, use `secondary` (avoids primary if chunk migrating). For writes, must go to primary (required). (4) increase migration timeout: if migrations are frequent and slow, writes may timeout. Measure: monitor balancer logs for migration duration. For production: migrations should complete <1s per chunk. If taking >10s, investigate network/shard health. Writes failing during balancing is normal transient; app should retry automatically.

Follow-up: How do you implement client-side retry logic for transient write errors during balancing?