MongoDB Interview — Security and Field-Level Encryption

Your healthcare application stores patient SSN in MongoDB. HIPAA compliance requires encryption: data must be encrypted at rest and in transit. You implement TLS between application and MongoDB (transit encryption), but SSN documents in MongoDB are still plaintext (at rest). A database admin or compromised server can read SSN. Design field-level encryption to meet compliance.

At-rest encryption requires client-side or server-side encryption of sensitive fields. Solutions: (1) Client-side field-level encryption (FLE): MongoDB 4.4+ supports Queryable Encryption. Application encrypts SSN before sending to MongoDB: `encryptedSSN = clientEncryption.encrypt(ssn, {algorithm: "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic", keyID: keyId})`. MongoDB stores encrypted blob. Only client with encryption key can decrypt. Advantage: server never sees plaintext. Disadvantage: can't query on encrypted field (deterministic encryption enables equality search, but not range queries); (2) Application-level encryption: encrypt SSN before insert: `ssn_encrypted = AES_encrypt(ssn, app_key)`, store encrypted, decrypt on read. Similar to FLE but app manages encryption; (3) Server-side encryption (at-rest only): MongoDB Enterprise Advanced with encryption at rest—encrypts all data using server-managed key. Protects against disk theft but not from database admins (they can still query plaintext on server).

For HIPAA: use FLE with deterministic encryption for SSN. Code example: `const encrypted = await clientEncryption.encrypt(ssn, {algorithm: "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic", keyID})`. Store encrypted. To query: `find({encryptedSSN: encryptedValue})` works for equality. For any other operation, decrypt in application.

Follow-up: Design a key management strategy for field-level encryption that rotates keys quarterly without re-encrypting all historical data.

You implement MongoDB FLE (queryable encryption) with deterministic encryption for SSN to support equality queries. However, you notice that all users with SSN "123-45-6789" produce the same encrypted value. A database analyst looking at raw MongoDB can count occurrences of the encrypted blob and infer that a certain SSN appears 500 times (pattern leakage). Additionally, someone could build a frequency table: hash of common SSNs and compare to ciphertexts in database. How would you prevent this inference attack?

Deterministic encryption (same plaintext = same ciphertext) enables equality queries but leaks frequency information (ciphertext frequency == plaintext frequency). This is acceptable for some data but not for highly sensitive data like SSN where frequency patterns matter.

Mitigation strategies: (1) Use randomized encryption: instead of deterministic, use randomized encryption (AEAD_AES_256_CBC_HMAC_SHA_512-Random). Each SSN encrypts to different ciphertext even if same value. Trade: can't query on encrypted field (can't find all SSN="123-45-6789"), only decrypt to compare; (2) Add noise: store fake encrypted values alongside real ones, making frequency analysis unreliable (noisy pattern, hard to infer); (3) Use order-preserving encryption (OPE): preserves order relationships (can query ranges) but doesn't leak frequency. More complex, some OPE algorithms have known weaknesses; (4) Application-level querying: don't query on encrypted SSN in MongoDB. Instead: decrypt all SSNs in application memory, filter in-app. Slower but no pattern leakage from database; (5) Homomorphic encryption: encrypt data in way that supports computation on ciphertexts without decryption (e.g., search within encrypted data). Complex, slow, newer technology.

For healthcare: use randomized encryption (no queryability on SSN) + application-level decryption for searches. SSN is rarely used for queries (usually use patient ID). If you must query SSN, use deterministic + accept frequency leakage (consider it low risk in controlled database environment).

Follow-up: Design an encryption strategy that supports searching on SSN while minimizing frequency leakage in a high-security environment.

Your MongoDB has two applications: AppA (internal audit tool) and AppB (customer-facing portal). Both query users collection. AppB should never see user phone numbers (sensitive), AppA should. You implement field-level encryption where phone is encrypted with key_A. AppB has key_app_b (no access to key_A). However, AppB can still query and retrieve encrypted phone field, decrypt it fails (key error) but AppB now knows the field exists and can observe ciphertext. How would you prevent unauthorized field access at the database level?

Field-level encryption doesn't provide field-level access control—it only encrypts. AppB can query and retrieve encrypted phone field. To add access control: (1) MongoDB role-based access control (RBAC): use MongoDB authentication/authorization to grant AppB role that doesn't allow reading phone field. Grant role only collection read, but exclude phone field: `db.createRole({role: "app_b_user", privileges: [{resource: {db: "mydb", collection: "users"}, actions: ["find", "update"]}, ...], restrictions: [{restrictionSelector: {phone: {$exists: false}}}]})`. This prevents AppB from retrieving phone field in queries; (2) Application-level masking: retrieve all fields, mask phone on application side before returning to AppB. MongoDB returns all fields, app filters: `doc.phone = "***-***-****"` before sending to AppB; (3) Use MongoDB views: create view for each application: `db.createView("users_for_app_b", "users", [{$project: {phone: 0, _id: 1, name: 1, ...}}])`. AppB queries view instead of collection (phone field excluded at query level); (4) Separate collections: partition users into `users_public` (no phone) and `users_private` (with phone). AppB accesses users_public only.

Recommended: MongoDB field restrictions (option 1) + views (option 3). Use MongoDB 5.0+ field-level restrictions. Example: `db.setRestrictionLevel("users", {fields: {phone: {level: "restricted", roles: ["audit_admin"]}}})`. AppB queries return empty phone field.

Follow-up: Design a multi-tenancy schema where each tenant's data is encrypted such that other tenants can't access it, even if they compromise a connection.

Your compliance requirement: audit trail of who accessed what data. Your MongoDB has encryption at rest. However, there's no built-in audit trail showing which user queried which document. When you query `db.users.find({ssn: "123-45-6789"})`, MongoDB doesn't log the query. A malicious database admin runs queries to extract SSN, and no audit log records it. Design an audit trail for data access.

MongoDB audit logs can be enabled (Enterprise only) to record queries, but default setup doesn't capture field-level access. Solutions: (1) Enable MongoDB Enterprise audit: configure `security.auditLog` to log all operations including queries and accessed fields. Log to centralized service (not on same server). Review logs regularly for anomalies; (2) Application-level logging: application wraps MongoDB queries with logging: `logger.info({user: userId, collection: "users", query: {...}, timestamp: now})` before each query. Logs sent to centralized audit system; (3) Database triggers: use MongoDB Atlas Triggers or change streams to log all reads/writes: `db.users.watch([{$match: {operationType: "find"}}]).on("change", (change) => auditLog.insert({user, query, ...}))`. Triggered automatically on queries; (4) FLE audit extension: MongoDB FLE can be configured to log decryption events (who decrypted what field). Use this for sensitive fields; (5) Virtualization layer: run MongoDB behind query proxy (e.g., ProxySQL, custom middleware) that logs all queries. Proxy sees all queries, logs them before forwarding to MongoDB.

For production compliance: use application-level logging (#2) for immediate deployment. Application logs all queries with user context before executing. Send logs to centralized audit system (Elasticsearch, Splunk, DataDog). For long-term, migrate to MongoDB Enterprise with audit logs (#1).

Follow-up: Design an audit system that detects anomalous data access patterns (e.g., admin querying sensitive records outside work hours).

Your financial application stores credit card numbers encrypted with MongoDB FLE. Your payment processor (external API) requires card numbers unencrypted to process charges. Your current flow: (1) retrieve encrypted card, (2) decrypt in application, (3) send to processor (plaintext over HTTPS). This works but exposes plaintext in application memory and network. A compromised application server or network sniffer can capture plaintext card numbers. Design a secure payment flow.

Exposing plaintext credit cards in application memory is PCI DSS violation (data must remain encrypted end-to-end except at point of use). Solutions: (1) Tokenization: instead of storing real card numbers, store tokens. When user adds card, processor returns token (e.g., "tok_1234567890"). Store token in MongoDB (non-sensitive, no encryption needed). To charge: send token to processor (never send card number). Processor uses token to retrieve card internally. This eliminates plaintext cards from application; (2) Server-side decryption: processor has private key, MongoDB has encrypted cards. Application stores reference to encrypted card, sends to processor: "charge encrypted_card_id" (no plaintext). Processor decrypts using its key (you never decrypt); (3) HSM (Hardware Security Module): store encryption keys on HSM (separate secure hardware). Application never has decryption keys, only requests operations: "decrypt and charge this card". HSM performs decryption internally, returns result (keys never leave HSM); (4) End-to-end encryption: encrypt card on client side (browser) before sending to application/MongoDB. Keys stored only on client. Application never has keys, can't decrypt. To charge: send encrypted card to processor who can decrypt (has different key).

PCI DSS compliant: use tokenization (option 1). Store token in MongoDB, never store or decrypt real card. When charging, send token to processor. Card number never touches application memory after initial tokenization.

Follow-up: Design a PCI DSS compliant system for storing and processing credit cards at scale (1000 charges/sec).

You enable MongoDB encryption at rest using AWS KMS (Key Management Service). Keys are managed by AWS, rotated automatically. However, after a security audit, you discover that a database admin with MongoDB root privileges can still bypass encryption: they can call `db.serverStatus()` and see the current encryption key ID (not the key itself, but metadata). Additionally, a rogue admin could restore a backup to a different AWS account (if they had backup access) and AWS KMS in new account would refuse to decrypt (key doesn't exist there). How would you tighten security?

Encryption at rest protects data on disk but doesn't prevent privileged users (database admins with root) from seeing encryption metadata or restoring backups. Solutions: (1) Separate encryption key management from MongoDB: don't let database admins manage or see encryption keys. Use AWS KMS with IAM policies granting only MongoDB process (running as specific service account) access to keys. Database admins can't retrieve keys even if they have root; (2) Use customer-managed keys: generate encryption key yourself, store in HSM or AWS KMS Customer Managed Key (CMK) with strict access policies. Grant only MongoDB service account access, not database admins; (3) Encrypt backups separately: when backing up, use separate encryption key (not the database encryption key). Backups are encrypted twice: once at database level, once during backup. Even if someone restores backup in different environment, backup encryption key (different from database key) is needed; (4) Implement key-per-tenant (for multi-tenant): each tenant has unique encryption key. A compromised backup for tenant A can't be decrypted in tenant B's environment (different key); (5) Audit key access: log all KMS key usage (AWS CloudTrail). If database admin tries to access KMS key directly, audit trail records it.

Recommended: use customer-managed KMS keys with restrictive IAM policies. Example AWS IAM: `{Effect: "Allow", Principal: {Service: "mongodb-iam-role"}, Action: "kms:Decrypt", Resource: "arn:aws:kms:..."}`. Only MongoDB service role can decrypt, database admins cannot. Store policies in separate AWS account (separate from main deployment account for isolation).

Follow-up: Design a security architecture where no single person/account can access unencrypted data, even in emergency scenarios.

Your MongoDB uses field-level encryption for sensitive fields (SSN, email, payment method). Performance impact: encrypted fields are slow to encrypt/decrypt (~5ms per field). Application has 100 fields per document, 20 fields encrypted. Insert latency: plaintext would be 10ms, with encryption adds 100ms (20 * 5ms). With 100K inserts/sec, this is 10M seconds CPU cost per day just for encryption. Your CTO asks: is there a way to reduce encryption overhead? Should we encrypt fewer fields?

Encryption overhead is real (CPU cost for cryptographic operations). With 100K inserts/sec and 100ms encryption overhead, you're burning significant CPU. Solutions: (1) Encrypt only truly sensitive fields: SSN, payment method, government IDs. Don't encrypt email (less sensitive in healthcare). Reduces fields from 20 to 5, overhead drops to 25ms per insert; (2) Use hardware acceleration: modern CPUs have AES-NI instruction set (hardware accelerated encryption). Ensure MongoDB runs on instance with AES-NI enabled and MongoDB is compiled to use it. Can reduce encryption latency by 3-5x (5ms -> 1ms); (3) Batch encryption: encrypt multiple fields in single operation instead of per-field. Some encryption libraries support batching with lower overhead; (4) Use faster encryption algorithm: AES-256-GCM is fast and secure. MongoDB FLE uses HMAC-SHA-512 which is slower. Check if you can use GCM; (5) Shard data: split 100K inserts across multiple MongoDB shards/clusters. Each shard handles 10K inserts, total CPU remains same but spread across more servers; (6) Accept the overhead: 100ms per insert might be acceptable if you can pipeline (issue multiple inserts in parallel). Total throughput stays high even with higher latency.

Recommended: encrypt only critical fields (SSN, payment). Reduce encrypted fields from 20 to 5-10. Verify AES-NI is enabled. This should reduce overhead to 25-50ms, acceptable for most systems.

Follow-up: If you must encrypt 20+ fields and maintain <50ms insert latency, how would you redesign the system?