System Design Interview — Design a Chat System

Your chat system (WhatsApp-like) serves 100M users with 50M online at peak. Messages are delivered via WebSocket to online clients. But users report that 2-3% of messages don't arrive on first send—they have to resend manually. Investigation shows the issue affects users on mobile networks (LTE) more than WiFi. What's your diagnostics plan and fix?

Message loss on mobile networks is typically caused by: (1) connection drops (LTE disconnects mid-message), (2) server-side buffering issues, (3) missing ACKs from client, or (4) browser/OS killing the WebSocket. Diagnostics: (1) Pull message IDs from failed sends in logs. Cross-reference with server delivery logs—did the server receive the message? Did it send an ACK? (2) Check WebSocket connection metrics: monitor connection drops by network type (WiFi vs cellular). If mobile has 10x more drops, it's a network issue. (3) Analyze message lifecycle: Client sends -> Server receives -> Server stores -> Server broadcasts to recipient's WebSocket -> Client receives ACK. At which step do 2-3% drop? Use request IDs and trace all steps. (4) Check for missing retry logic: if client sends a message and the WebSocket connection drops before ACK, does the client retry? If not, message appears lost. Root cause in order of likelihood: (a) Client doesn't retry on connection drop. (b) Server doesn't ACK reliably (server crash before ACKing). (c) Recipient's WebSocket drops just before receiving the message. For fixes (immediate): (1) Client-side: implement exponential backoff retry. Client sends message, waits for server ACK (message_id confirmation). If timeout (3s) or connection drop, retry up to 3 times. Store pending messages locally (IndexedDB). When connection restores, flush pending messages. (2) Server-side: durability. Before sending to recipient's WebSocket, persist message to database. Send WebSocket delivery. If delivery fails (recipient offline), message waits in queue. When recipient connects, deliver from queue. (3) Add explicit ACKs: client must send ACK after receiving message. Server retries delivery until ACK received. (4) Mobile-specific: implement background re-sync. When app wakes up (user opens chat again), sync all messages from last 30 minutes. Fills in any gaps. Testing: (1) Chaos test: kill WebSocket mid-message, verify retry works. (2) Network simulation: throttle to 3G speeds, introduce 10% packet loss, verify no message loss. (3) Scale test: simulate 10M concurrent WebSockets, verify server handles connection churn. Typical result: with client-side retry + server durability + explicit ACKs, message delivery reaches 99.99%. Mobile vs WiFi gap closes to < 0.1%.

Follow-up: If the recipient is offline, how long should you queue messages? At what point do you drop or archive them?

Your group chat feature allows users to create chats with up to 1000 members. When a user sends a message to a 1000-person group, the broadcast takes 45 seconds to reach all members. Users complain that some members get the message minutes later. What's the bottleneck, and how do you redesign for fast group delivery?

Broadcasting to 1000 concurrent WebSocket connections is expensive—you're doing 1000 writes in sequence or parallel. If sequential (waiting for each to complete), that's 45ms per delivery * 1000 = 45 seconds. If parallel, it's bottlenecked by your server's I/O capacity. Architecture redesign: (1) Message broker (Kafka/Redis): instead of directly pushing to 1000 WebSockets, publish message to a pub/sub topic. Each connected client subscribes to the group topic. Broker handles fan-out. This is near-instant for the publisher. (2) Shard connections by group: assign each group to a connection shard (server A handles group 1, server B handles group 2). When a message arrives for group 1, only shard A is involved. Reduces contention. (3) Async delivery queue: push message to a queue for each member. A background worker pool pulls from queues and delivers to WebSocket connections. This decouples message ingestion from delivery. (4) Tiered delivery: deliver message to client's current shard first (< 100ms). Offline members are queued for later. Bandwidth-limited members get message to offline queue but not immediately pushed. (5) Read replicas for presence: maintain a presence index (who's currently connected to group X). Store in Redis. On message, look up presence, only push to online members. Offline members are automatically queued. Implementation: (1) Use Redis pub/sub for intra-group messaging. When a message is sent to group, PUBLISH to a Redis channel. All connected servers listening to that channel receive it immediately. They then deliver to their local WebSocket clients. (2) Use a persistent queue (Kafka) for messages to offline members. Background consumer: fetch messages for offline users, wait for them to come online, deliver. (3) Optimize delivery: batch pushes (send 50 messages in one HTTP/WebSocket frame instead of 50 frames). (4) Scale WebSocket servers: if you're at capacity, spin up new servers. Use a load balancer to route new connections. Test: with Redis pub/sub + batch delivery, you can broadcast to 1000 members in < 500ms (95th percentile). The key insight: use message brokers and async patterns. Never do synchronous 1000-way fan-out directly from the API server.

Follow-up: If Redis pub/sub messages are lost if no one is listening (not persisted), how do you ensure offline members get the group message when they reconnect?

Your chat app reports presence status (online, away, offline, typing). A user sends "typing..." indicator for 30 seconds but the receiver's app crashes before receiving it. When the sender is still typing, the receiver's app recovers. Should they see the "typing" status? Design a presence system that handles app crashes gracefully.

Presence is soft real-time state that's OK to be slightly stale, but should be consistent with reality. The issue: if a receiver crashes, they miss the "typing" status. When they reconnect, should they see it? Logic: (1) Typing indicators should have a timeout (e.g., 3 seconds). Sender must refresh every 3 seconds or it auto-clears. (2) On receiver reconnect, don't replay old typing status. Instead, query current state: is the sender still typing? (3) Architecture: store presence in a distributed cache (Redis) keyed by (user_id, chat_id, presence_type). E.g., "user:123:typing:chat:456" = timestamp. (4) Presence heartbeat: every 10 seconds, connected clients send a heartbeat. Server updates Redis. If no heartbeat for 30 seconds, assume user is offline. (5) On crash/reconnect: (a) Client notifies server "I'm back". Server queries Redis for all active presence states in their chats. Only return recent ones (< 10 seconds old). (b) Receiver's app receives "typing" status if and only if the timestamp is fresh. (6) Sending typing indicator: Client sends "typing" event. Server stores in Redis with current timestamp. Broadcasts to all recipients in the chat. On receiver's app: if you receive "typing" for user X at time T, display it for 10 seconds. After 10 seconds, clear it (auto-expire). (7) Graceful degradation: if Redis is slow/unreachable, fall back to best-effort delivery. Some typing indicators might be lost, but core chat functionality continues. Implementation: (1) Presence service: separate microservice that manages Redis state. API: POST /presence/{user_id}/typing for 10 seconds. (2) Presence cache: 10 MB Redis hash per 1M users (store only last presence update, not history). (3) Garbage collection: hourly job cleans up stale presence entries (> 1 hour old). Testing: crash receiver's app, sender sends typing indicator, verify receiver doesn't see it. App recovers, sender is still typing, verify receiver sees current state. This design ensures presence is always consistent with reality, never stale beyond 10 seconds.

Follow-up: How would you implement "read receipt" (where the receiver explicitly marks the message as read)? What are the data consistency challenges?

Your chat system uses end-to-end encryption (E2E). Messages are encrypted client-side, server stores ciphertext, client decrypts on receive. But your monitoring system can't see message content for abuse detection (spam, harmful content). Design a system that supports E2E encryption while enabling server-side content moderation.

E2E encryption vs. moderation is a classic tradeoff. True E2E means server can't see content, so you can't moderate. Several approaches, each with tradeoffs: (1) Client-side moderation: device downloads moderation rules (ML model, keyword list), runs moderation locally before sending. Upside: privacy preserved. Downside: user can bypass rules (decompile app). (2) Homomorphic encryption: messages encrypted such that server can perform moderation computations without decryption. Upside: full E2E + moderation. Downside: computationally expensive, not production-ready for chat scale. (3) Escrow keys: user encrypts with a key, but escrows the key with server (encrypted with server's public key). On report of abuse, server decrypts the key, accesses plaintext. Upside: balance of privacy + accountability. Downside: complex key management. (4) Trusted execution environment (TEE): messages decrypted inside a secure enclave (e.g., AWS Nitro, Intel SGX) on the server. Moderation runs inside the enclave. Server can't see plaintext. Upside: moderation works, privacy preserved. Downside: infrastructure-heavy, not portable. (5) Hybrid: E2E encryption for peer-to-peer, but group chats are encrypted to a server key (server can see). Moderation runs on group chats only. Upside: simpler, covers most abusive content. Downside: peer-to-peer is private, group chats are not. Recommended approach for most products: (1) Use option (5): hybrid E2E. Most abuse happens in group chats anyway. (2) For peer-to-peer: respect E2E fully. No server-side moderation. (3) For group chats: server holds group encryption key. Messages encrypted with group key (not user key). Server can decrypt and moderate. (4) Implement in layers: (a) TLS encryption in transit (between client and server). (b) Client-side encryption with E2E for peer chats, group key for group chats. (c) On server: decrypt group messages, run ML model (TensorFlow, spam classifier, toxicity detector). Flag suspicious messages. (d) Human review: flagged messages go to a moderation queue. Human reviewer decides action (delete, warn user, ban). (5) Privacy: never store plaintext long-term. After moderation decision (1-2 days), delete plaintext and keep only ciphertext. (6) Transparency: inform users: "Group chats are moderated for safety. Peer chats are E2E encrypted." Implementation: (1) Modify encryption layer. Peer chats use user's asymmetric key. Group chats use a symmetric group key (stored on server, each member has their copy). (2) Moderation service: fetches encrypted group messages, decrypts with group key, runs ML models. (3) Testing: send known spam/toxic content, verify it's flagged. Send benign content, verify it passes. This balances user privacy with platform safety.

Follow-up: If users have group encryption keys, how do you prevent a user from sharing it with others outside the group?

Your chat system stores messages in a primary database (MongoDB) for fast reads but also needs to support full-text search across billions of messages. Search latency is currently 5 seconds. Users expect results in < 500ms. How do you redesign?

Full-text search on a primary database doesn't scale. You need a dedicated search index. Architecture: (1) Use Elasticsearch or similar (Solr, Meilisearch). Elasticsearch excels at full-text search and returns results in < 100ms for billions of documents. (2) Dual-write pattern: when a message is stored in MongoDB, also index it in Elasticsearch. Use a message queue (Kafka) to decouple: MongoDB write -> Kafka -> Elasticsearch consumer. This ensures eventual consistency and prevents Elasticsearch slowness from blocking message insertion. (3) Indexing strategy: index searchable fields (message body, sender name, timestamp). Don't index media (videos, images) or metadata (internal IDs). (4) Shard by user: each user's messages are indexed in a separate Elasticsearch shard. Query space is dramatically reduced (search only your own messages + group chats you're in). (5) Retention: search index only keeps 1 year of messages (configurable). Older messages archived to cold storage (S3). For historical search, query cold storage (slower, but acceptable). (6) Real-time indexing: use Kafka + Elasticsearch consumer. On message insertion, enqueue. Consumer batches 100 messages, indexes to Elasticsearch. Latency: < 1 second from message to searchable. (7) Caching: popular searches (e.g., "hello") are cached. Results served from Redis cache within 10ms. (8) Testing: insert 100M messages, run 1000 concurrent searches, measure P99 latency. Should be < 500ms. Implementation steps: (1) Deploy Elasticsearch cluster (3 nodes for HA). (2) Modify message insertion logic: after MongoDB write, enqueue to Kafka. (3) Deploy Elasticsearch consumer: listens to Kafka, batches, indexes. (4) Update search API: query Elasticsearch instead of MongoDB. (5) Migrate existing messages: bulk index all messages from MongoDB to Elasticsearch (background job, doesn't block live messages). (6) Monitor: ensure Elasticsearch index is healthy, indexing lag is < 1 second, and search latency is < 500ms. Result: with Elasticsearch, search latency drops from 5 seconds to < 200ms (P99).

Follow-up: If Elasticsearch index becomes corrupted, how do you recover? Can you rebuild from MongoDB?

Your chat app allows users to share media (photos, videos). A user uploads a 500 MB video to a group chat. The message service processes it but CDN delivery is slow for some regions (P99 latency 8 seconds). Design a media handling system with fast global delivery, virus scanning, and storage efficiency.

Media at scale requires: (1) efficient storage, (2) global fast delivery, (3) virus scanning, (4) transcoding/optimization. Architecture: (1) Storage: use object storage (S3, GCS). Store video in a regional bucket (closest to uploader). Replicate to other regions asynchronously. (2) CDN: CloudFlare or CloudFront. Configure CDN to cache media globally. Cache miss: CDN fetches from origin (S3). Cache hit: serve from edge location (nearest to user). Set TTL to 30 days (videos are mostly immutable). (3) Virus scanning: before storing, scan with ClamAV (or cloud service). Upload -> Async scan -> Quarantine until cleared -> Release to CDN. Infected files marked as unsafe, not delivered. (4) Transcoding: large videos should be transcoded to multiple bitrates (480p, 720p, 1080p). Clients choose quality based on connection. Use a job queue (ffmpeg workers) to transcode asynchronously. (5) Progress tracking: client needs to see progress. Use resumable uploads (S3 multipart). Client breaks video into chunks, uploads in parallel. Server tracks progress, updates database. (6) Optimization: compress images (JPEG -> WebP). For videos, use efficient codec (H.265 instead of H.264). (7) Cleanup: if user deletes message, mark media for deletion. After 7 days, delete from S3 and CDN. Implementation: (1) Upload service: accepts media, runs preliminary checks (size, MIME type), starts async pipeline. Returns upload URL. (2) Async pipeline: (a) Scan for viruses. (b) Transcode videos (if needed). (c) Upload to S3 with replication. (d) Preload to CDN. (3) Delivery: when recipient opens message, serve media URL from CDN. CDN handles edge caching. (4) Monitoring: track upload success rate, transcode queue depth, virus detection rate, CDN hit rate. For 500 MB video: (a) User uploads in chunks (multipart). Entire upload takes 30-60 seconds. (b) Async scan (5 min). (c) Transcode to 3 bitrates (10 min per bitrate = 30 min total). (d) CDN preloading (5 min). (e) Total: 50 minutes before video is fully available in all regions. Meanwhile, clients can start watching immediately with available bitrate. Result: P99 delivery latency < 1 second globally (CDN edge), virus-scanned, optimized for bandwidth.

Follow-up: If a user attempts to upload a video larger than your limit, how do you handle rejection gracefully without losing data?

Your chat system's database (MongoDB) is reaching capacity. Sharding by user_id is problematic because group chats span multiple shards. Queries like "get all messages in group X" require a scatter-gather across shards and are slow. Design a sharding strategy that keeps group chats co-located while maintaining write balance.

Sharding is hard when data doesn't partition cleanly. Your options: (1) Shard by group_id: all messages for group X go to the same shard. Group chats are fast (single shard query). But user-specific queries (all messages for user Y) require scatter-gather across all shards. Acceptable tradeoff if most queries are group-centric. (2) Shard by user_id: all messages from user Y go to shard Y. But group chats scatter across multiple shards. Queries become slow. (3) Denormalization: maintain both views. One collection sharded by group_id (fast group queries), one by user_id (fast user queries). On message insertion, write to both. Complexity: double the writes. (4) Consistent hashing: shard key = hash(group_id) if message is in a group, else hash(user_id) for peer chats. Mixed approach. (5) Hierarchical sharding: primary shard by group_id. Within each shard, secondary index by user_id. Queries first filter by group, then by user. Fastest for group + user filters. Recommended for most chat systems: (1) Shard by group_id. Accept that user-specific queries are scatter-gather. (2) Cache user queries: use Redis. When user opens chat app, fetch their last 10 groups (sorted by recent message). Cache these group IDs. Queries now only hit those shards. (3) Partition group chats: assign groups to shards using consistent hashing. Example: shard_id = hash(group_id) % 256. This ensures even distribution of groups. (4) Handle hot shards: if one group is very popular (1000 concurrent users, millions of messages), it becomes a hotspot on its shard. Mitigation: (a) Read replicas: add read replicas of the hot shard for distributes reads. (b) Caching: cache recent messages for the hot group in Redis. (c) Splitting: split the hot group across multiple shards if it grows too large. (5) Write optimization: bulk insert messages in batches (100 messages per operation) instead of 1 message per insert. This reduces write latency by 10x. Implementation: (1) Define sharding key: group_id. (2) Deploy MongoDB replica sets, each representing a shard. (3) Routing layer: application code or MongoDB router determines shard based on group_id. (4) Monitoring: ensure shards have balanced message count. If one shard is 10x larger than others, investigate. (5) Scale: add new shards by splitting an existing shard (move half its data to new shard). Results: group queries are O(1) shard lookups, user scatter-gather queries are O(number of groups) but cached in Redis so mostly O(1). This scales to billions of messages across hundreds of shards.

Follow-up: When you split a hot shard by moving data to a new shard, how do you prevent messages from being lost or duplicated during the migration?

Your chat system crashed yesterday and lost 30 minutes of messages (database corruption). You've restored from a backup from 1 hour ago. Now you have a time gap: messages from 30-60 minutes ago are on users' clients but not in your database. Users see duplicates or gaps when they reopen the app. How do you handle this gracefully without re-sending lost messages or creating confusion?

This is a disaster recovery scenario where you have client-side state that doesn't match the database. Key principles: (1) accept data loss honestly, don't pretend recovery worked perfectly, (2) prevent duplicates, (3) minimize user confusion. Architecture: (1) Versioning: messages have server-assigned sequence numbers (message_version). When restoring from backup, the latest message_version is lower than what clients may have. (2) Client sync on app open: when user opens chat, client sends: "I have messages up to version X. What's the latest version on server?" Server responds with: version Y. (3) Gap handling: if Y < X, there's a gap (messages lost in crash). Client options: (a) discard local messages beyond version Y (safest—prevents duplicates, but users lose unsent messages), (b) show warning: "Some messages may have been lost. Your recent messages are queued for re-send, but may be duplicates. Send again?" (c) mark locally-cached messages as "unconfirmed" visually (gray out, add "may not have sent" label). (4) Deduplication on re-send: if user re-sends a message they think was lost, server deduplicates using idempotency tokens. Client generates token before first send attempt. If client re-sends, uses same token. Server sees same token, returns already-committed message instead of duplicating. (5) Message reconciliation: run a background job: for each group chat, query clients (push a "sync request"), collect their message versions, find gaps. Log discrepancies for ops review. (6) Transparency: send in-app notification: "We experienced an outage and recovered from backup. Messages between X-Y UTC may have been lost. We apologize. Check your recent chats to verify." (7) Graceful degradation: temporarily disable message history search (old messages might be incomplete). Re-enable once confirmed all messages are present. Implementation: (1) Add message_version field to messages table. On insert, auto-increment. On backup restore, preserve version numbers. (2) Client: on app open, query server for latest_version, compare with local max_version. If gap, trigger sync flow. (3) Idempotency: store idempotency tokens in a dedup table. Key = (user_id, idempotency_token). Value = message_id. On re-send, check dedup table first. (4) Testing: simulate database crash, restore from backup 1 hour old, verify: (a) clients sync correctly, (b) no duplicates on re-send, (c) gaps are visible to users, (d) chat remains functional. This approach trades some message loss (unavoidable in a crash) for consistency. Users see that something happened but the system recovers gracefully without corruption.

Follow-up: If users are upset about message loss, should you manually reconstruct messages from app client backups or just accept the loss?