System Design Interview — Real-Time Feed and Notification Systems

Your notification system serves 50M daily active users with push, email, and in-app notifications. At 2 PM on a Thursday, you observe 500M undelivered notifications stuck in the queue. Your team begins investigating. P1 incident. Walk us through your root cause analysis and recovery strategy within the next 2 hours.

First, check queue depth metrics and compare against SLA (typically 99th percentile delivery within 60 seconds). Immediately pull logs from all notification workers—check CPU/memory/disk usage, network I/O, and error rates. Common culprits: (1) database connection pool exhaustion—too many workers competing for connections; (2) downstream provider rate limits (FCM, APNS, Sendgrid)—they've started rejecting payloads; (3) storage write amplification—Redis/Kafka filling up. Pull the last 30 minutes of traces. If it's a worker crash loop, review recent deployments. If queues are backing up but workers are healthy, the bottleneck is downstream (provider timeout). Immediately implement backpressure: reduce ingest rate at edge via token bucket (accept only 80% of normal load), batch requests more aggressively, and failover to secondary providers. Parallel: page on-call for each provider to check their status dashboards. If their infrastructure is degraded, implement exponential backoff and queue to disk. For recovery: prioritize high-value notifications (account security, payments) over promotional. Gradually ramp ingest back up in 10% increments every 5 minutes while monitoring queue depth. Collect timeline and RCA post-incident—almost always a combination of under-provisioned workers and no circuit breaker for provider timeouts.

Follow-up: How would you redesign the system to prevent this exact scenario? Specifically, what circuit breaker logic, queue isolation, and provider failover would you implement?

Engineers report that certain users receive notifications 4-8 times—exact duplicates with same ID and timestamp. It's affecting ~0.2% of user base but trending worse (0.5% by next day). Your ops team is getting reports of users disabling push entirely. What's your investigation plan, and how do you prevent recurrence?

Duplicates at this scale almost always point to retry logic gone wrong. Check if notification workers have idempotency keys (unique fingerprints of user + notification type + timestamp). If they're just checking 'notification_id', but your retry mechanism regenerates IDs, you'll get dupes. Pull the duplicate notifications—look at their UUIDs, timestamps, and metadata. If they share the same `notification_id` but hit 4-8 times, it's likely Kafka consumer group rebalancing or worker crashes triggering redeliver on not-yet-committed offsets. If UUIDs are different, it's your application logic creating duplicates (notification service called multiple times upstream). Check dead-letter queues (DLQ) for failed deliveries in the 4-hour window. Then verify: (1) Kafka offsets—are you committing after confirmed delivery to provider or before? (2) At-least-once vs exactly-once semantics—are you guaranteed once? (3) Deduplication window—do you dedupe over 24 hours or just in-flight? For prevention: implement a deduplication service (Redis with 24-hour TTL keyed on user_id + notification_content_hash). Atomically check and set this before enqueuing. Add idempotency headers to all provider APIs (FCM, APNS support `idempotency_key`). Make Kafka commits happen only after provider ACK. Test chaos scenarios: kill workers mid-commit, simulate provider timeouts, force rebalances.

Follow-up: If you're at exactly-once semantics but still seeing duplicates, what would be your next debugging step? Could it be a data layer issue?

Product wants to ship in-app notification badges that live in a sidebar—they should disappear when a user dismisses them. Your current architecture publishes to a topic, workers consume, and send to providers. Now you need bidirectional state: app sends 'dismissal' events back, and workers need to know not to send duplicates. How do you add this without breaking existing flow?

This is a shift from fire-and-forget to stateful notifications. You need a notification state store. Here's the architecture: (1) Notifications table with UUID, user_id, content, created_at, status (pending/delivered/dismissed/expired). (2) Publish notifications to Kafka as before, but workers write initial state to the store before sending to providers. (3) Add a second Kafka topic for user dismissal events. A separate consumer group listens, looks up notification in store, and sets status to dismissed. (4) Before sending provider requests, workers check state store—if already dismissed, skip. (5) Provider webhooks (delivery receipts) also write to state store. Use Redis for low-latency reads (cache with 10-minute TTL), backed by PostgreSQL for durability. For exactly-once: write to state store atomically with provider call (use database transaction). For dismissals, use an idempotent update—don't re-process if status is already dismissed. Handle race conditions: if dismissal arrives before delivery, mark as dismissed but still deliver (user sees it, can dismiss again). Test edge case: user dismisses on mobile, web client still shows it—eventually consistent is acceptable here, but propagate dismissal to all devices via broadcast channel. This adds ~15-20ms latency per notification (state store lookup) but dramatically improves user experience.

Follow-up: What consistency model did you choose, and why? Could you use eventual consistency differently here to reduce latency?

Your analytics team reports that 8% of emails are delivered to spam. Product wants to reduce this to under 1%. You're using SendGrid. Walk through how you'd diagnose and fix this—including sender reputation, content, and infrastructure changes.

Email spam is a multi-faceted problem touching authentication, content, and sender reputation. Start with diagnostics: (1) Pull SendGrid bounce reports—what percentage are soft vs hard bounces? Hard bounces mean bad addresses or blocks. (2) Check SPF, DKIM, and DMARC records—are they correctly configured? Many companies have misconfigured DKIM private keys or SPF whitelist is wrong. Verify with tools like MXToolbox. (3) Content analysis—SendGrid provides spam score in headers. Pull samples of emails flagged as spam vs inbox. Look for patterns: too many links? Certain keywords? Suspicious redirects? (4) Sender reputation: check your IP's reputation on SendGrid's intelligence dashboard and external tools (RBLs, Spamhaus). If reputation is poor, you may need to rent SendGrid's IPs or warm up a new sending IP (ramp volume gradually over weeks). (5) Segmentation: separate transactional emails (receipts, password resets) from marketing. Transactional should be 95%+ inbox. If marketing is dragging down overall numbers, split infrastructure. For fixes: (1) Fix auth records immediately. (2) Content: remove suspicious links, shorten URLs (use your own domain redirect, not URL shortener), personalize with user name, avoid all-caps subject lines. (3) List hygiene: suppress bounced addresses, scrub against complaint lists. (4) Volume ramping: if you've never sent to a recipient before, send test emails first and monitor feedback loops (SMTP 421, 422 soft rejects). (5) Timing: stagger sends across hours to avoid spam filter triggers on volume spikes. Measure success within 1-2 weeks. If stuck above 3%, you may need third-party ESP like Klaviyo or Sailthru that has pre-warmed sending infrastructure.

Follow-up: If your sender IP reputation is tanked, what's the recovery timeline, and can you mitigate while repairing it?

Your system uses separate queues for push (FCM/APNS), email (SendGrid), and in-app notifications. During a game launch with 5M simultaneous users, all three queues spike but at different rates. Your workers process them sequentially in a single pool. Latency for in-app notifications jumps from 200ms to 45 seconds. Users see stale feeds. How would you fix this in production right now (no weeks of refactoring)?

This is a classic resource contention problem. Quick wins: (1) Split worker pools—allocate separate worker processes/containers for each queue. Push and in-app are fastest (< 500ms provider latency), so they get priority. Email is slowest (5-10s provider latency), so isolate it. (2) Use different queue depths for backpressure: push queues accept up to 100K messages, in-app up to 50K (in-app is fastest and most latency-sensitive), email up to 10K (slower provider). (3) Add priority queueing within each pool: security alerts > system messages > promotional. (4) Implement dynamic worker scaling: if queue depth exceeds threshold, spawn 2x more workers (cloud-native: scale k8s deployment). For in-app specifically, bypass the heavy queue if possible—use Redis PubSub to push notifications directly to connected clients (web sockets). Only fall back to queue if client is offline. This drops latency to <50ms for online users. (5) Add circuit breaker for providers: if FCM is slow, don't let it block in-app workers—fail fast and queue for retry. Measure per-queue SLAs: push 90th percentile < 2s, in-app 90th percentile < 500ms. Implement canary deployments: test new logic on 5% of traffic first. For long-term: event sourcing + materialized views—write all events to a log, have separate services consume and optimize for each notification type. This is 2-3 sprints to build properly but the 1-day fix (worker pool splitting + queue prioritization) gets you 80% of the way there.

Follow-up: If you split worker pools, how do you ensure you don't over-provision one pool and under-provision another? What metrics would drive scaling decisions?

Your company expands to a new country with strict data residency laws (all personal data must stay in-country). Your current notification system stores user preferences and delivery logs in a central database. Design the multi-region strategy with minimal latency impact and zero compliance violations.

Data residency is a hard constraint; you must architect for it from day one. Here's the approach: (1) Segment your database schema into geo-bound and global. Global: user_id, account tier (replicated everywhere with strong consistency). Geo-bound: user_preferences (notification frequency, language, timezone), delivery_logs (for compliance audits). (2) Use multi-master replication for user tables across regions but configure replication filters—Europe data never replicates to US. PostgreSQL with logical replication handles this; Spanner does it natively. (3) For notification workers: deploy regional clusters. A US user's notification request hits the US region exclusively. A European user hits the EU region. Use a geo-routing layer at the edge (your API gateway or DNS) to route requests to the right region. (4) For providers (FCM, APNS): these are global, but you control the request origin. Ensure requests originating from EU region use EU endpoints if available, or accept the provider's terms on cross-border data transfer. (5) Preferences store: local cache in each region (Redis) with eventual consistency for reads, strong consistency for writes (write to local DB first, async replicate to other regions only if explicitly shared—don't by default). (6) Compliance: audit logs stay local. Every delivery attempt (timestamp, user_id, provider response) written to regional store. Annual audits read local logs only. (7) Data deletion: GDPR right-to-be-forgotten means you must delete all traces in one region without affecting others. Use a per-region soft-delete flag, then batch deletion after retention period. Test failover: if EU region fails, users' data doesn't leak to US. Design for this. Typical latency impact: +50ms for cross-region lookups, but local caching mitigates.

Follow-up: What happens if a user moves from the US to the EU? How do you migrate their notification history and preferences?

During the holiday season, notification volume triples but third-party providers (FCM, APNS, SendGrid) begin rate-limiting your requests. You can't upgrade their tiers mid-incident. What's your strategy to maintain SLAs without throwing away money on upgrade fees?

This is about intelligent degradation and queue management. Providers typically rate-limit around 50K-200K requests per second per customer (varies). When you hit limits, you get 429 responses. Quick mitigation: (1) Implement token bucket with provider limits baked in. If FCM allows 100K RPS, set your limit to 90K to leave headroom. (2) Queue requests when hitting limits—don't reject them, batch and retry with exponential backoff (1s, 2s, 4s, 8s, max 60s). (3) Prioritize: security alerts get through 100% of the time, marketing gets 50%. Use priority queues for this. (4) Implement fair-queuing across customers (if you're a platform serving multiple apps): don't let one app hog all provider quota. (5) Deduplication: if a user received 5 notifications about the same event in the past hour, coalesce them into one. Reduce total volume by 20-30% with zero user impact. (6) Batch optimization: send 100 notifications per HTTP request instead of 1. Providers support this; reduces request count by 100x, still hits same message volume but under the RPS limit. (7) Async retry: use Kafka or similar to queue failed requests. Process them during off-peak hours (2 AM) when traffic is lower. Many notifications are time-sensitive (< 1 hour) but others aren't. (8) Load shifting: use time-zone awareness. If it's 2 AM in Asia, send their notifications during their morning (6 AM). Shift some volume to later in the day. With these tactics, you can typically handle 3x volume with same provider tier—delays are acceptable for non-critical notifications. Test this: simulate a 3x surge in staging, measure queue depth and delivery latency SLOs. If you're still exceeding SLOs, then upgrade tiers, but usually this strategy gets you through.

Follow-up: How would you measure the impact of deprioritizing marketing notifications? What's the acceptable tradeoff?

Your security team discovers a bug: notification delivery logs are being stored unencrypted in PostgreSQL and are readable by anyone with database access. These logs contain user emails, phone numbers, and activity patterns. This is a compliance nightmare (PII exposure). You have 48 hours to fix. Walk through your remediation plan, including retroactive cleanup and forward-looking architecture.

This is a critical incident with legal/compliance implications. Immediate actions (next 4 hours): (1) Assess blast radius: query logs to understand what's exposed, how many users, date range. (2) Notify security and legal immediately—they'll guide disclosure requirements. (3) Stop the bleeding: halt new logging of sensitive fields. Update your logging code to drop PII (never log full email, only hash or redacted version). Deploy to production. (4) Encrypt in transit: ensure all database connections use SSL/TLS (check your DB connection strings, enable require_ssl). For retroactive cleanup (next 24 hours): (1) Identify sensitive fields in logs (email, phone, IP). (2) Write a batch job that reads old logs, redacts PII (replace with hashes or placeholders), and writes back to the same table. This is safer than deletion in case you need to retain logs for compliance. (3) Verify redaction: spot-check 100 rows to ensure no readable PII remains. (4) Archive old logs (before redaction) to cold storage (S3 with encryption) and delete from hot DB—reduces surface area. Forward-looking architecture (next 2 weeks): (1) Encrypt sensitive fields at rest using AES-256. PostgreSQL pgcrypto or third-party KMS (AWS KMS, HashiCorp Vault). (2) Use role-based access control (RBAC): most engineers can't query logs; only data access teams can, and their queries are audited. (3) Implement field-level encryption: notification_id (not PII) is plaintext for indexing; email is encrypted, requiring KMS decryption to read. (4) Data retention policy: logs expire after 30 days (comply with minimal legal hold), then auto-deleted. (5) Audit trail: log all database queries to a separate immutable store (CloudTrail, or similar). Measure: time to detection (in this case, you should have detected via automated scanning), time to remediation (you're aiming for 24 hours), and compliance verification (legal signs off).

Follow-up: How would you design an automated detection system to catch this class of PII exposure in the future?