System Design Interview — Design a Rate Limiter Service

You've built a standalone rate-limiting service used by 200 microservices. Each service has different rate limits (API calls, user API calls, etc.). At peak traffic, rate limiter is receiving 1M requests/second for limit checks. Redis (backing the limiter) is at 90% CPU. You can't just add more servers—the limiter is already distributed. What do you do?

Redis at 90% CPU means your limiting logic is too expensive or your sharding is inefficient. Diagnose first: (1) Profile Redis commands: is it being hammered by INCR? HMGET? Use redis-cli MONITOR to see command patterns. (2) Check key distribution: if all 1M requests/sec are for 10 keys (all users hitting the same user), one Redis node is overloaded. Check cardinality of keys. (3) Measure latency: what's the P50, P99 latency of limit checks? If > 50ms, even though CPU is high, throughput might be acceptable. Solutions in priority order: (1) Connection pooling: if each microservice opens a new connection to Redis for every check, that's overhead. Use connection pooling (TCP multiplexing). Redis Cluster mode supports this. (2) Batch requests: instead of checking rate limit per request, batch 100 requests and check limit for the batch. Reduces requests by 100x. Downside: slightly less granular. (3) Sharding key optimization: current shard key might be user_id. If one user is hammering, one shard gets overloaded. Shard key options: (a) hash(user_id + request_type). Spreads same user across multiple shards if they have different limits. (b) hash(user_id + time_bucket). Changes shard every minute. More complex but handles bursty traffic better. (4) Local caching: each microservice keeps a local in-memory cache of limits (rate, window, reset time). Only query Redis on cache miss or window expiry. This is a probabilistic counter—eventual consistency but dramatic performance gains. Example: track requests locally for 100ms, then sync to Redis. Latency for local checks: < 1 microsecond. Accuracy loss: negligible. (5) Pipeline requests: use Redis PIPELINE to batch multiple INCR commands into one round-trip. Instead of 1M requests/sec to Redis, batch into 10K pipeline requests. (6) Upgrade Redis: if your hardware is maxed out, move to a cloud Redis with better specs (AWS ElastiCache with more CPU cores, or Memcached which is simpler/faster for this use case). (7) Algorithm change: sliding window (precise but expensive) -> fixed window (less precise but fast). Example: limit 1000 requests per minute. Instead of tracking every request timestamp, use a counter that resets every minute. Much faster. Recommended approach: (1) Implement local caching (option 4) + pipeline (option 5). This alone gives you 100x throughput improvement. (2) Verify with load test: 1M requests/sec locally cached should be achievable on a single machine (< 10% CPU). (3) If still needed, add more Redis nodes and re-shard. Result: with local caching + pipelining, Redis CPU drops from 90% to < 10%, and latency drops from 50ms to < 10ms.

Follow-up: If you use local caching in each microservice, how do you update limits globally when an admin changes a rate limit? Do you have to redeploy?

Your rate limiter uses sliding window algorithm: for each user, track all request timestamps in the last N seconds. But with 1B users and 100K requests/sec, storing every timestamp is using 500 GB of memory. Latency is also increasing. Redesign for efficiency.

Sliding window with full timestamp history is memory-intensive. For 1B users * 100 requests/user * 60 second window = 6B timestamps. Each timestamp is 8 bytes = 48 GB minimum. This doesn't scale. Better algorithms: (1) Fixed window: instead of tracking timestamps, use a counter that resets every minute. At 00:00-00:59, track requests. At 01:00, reset. Pros: O(1) memory per user, O(1) complexity. Cons: edge case at window boundaries (user can get 2x requests if they request at 00:59:59 and 01:00:01). (2) Sliding window with buckets: divide the time window into buckets (e.g., 60 second window -> 6 buckets of 10 seconds each). Store only the count per bucket. Memory: 6 integers per user = 48 bytes. Query: sum counts from relevant buckets. Accuracy: ~10 seconds (depends on bucket size). (3) Token bucket: maintain a "bucket" of tokens per user. Tokens are added at a fixed rate (e.g., 100 tokens/second). On request, consume 1 token. If bucket empty, request is rate-limited. Memory: O(1) per user (just the bucket count and last refill timestamp). Pros: handles bursts gracefully (user can consume from accumulated bucket). Cons: requires periodic refill computations. (4) Leaky bucket: similar to token bucket but request queue is leaky (requests are processed at a fixed rate). Memory: O(queue_length). Complexity: higher. Recommended for most systems: token bucket. It's elegant, has O(1) memory, and handles bursts. Architecture: (1) For each (user_id, api_key, limit_type), store: (tokens_available, last_refill_time). Example: user 123, get-api, 100 requests/minute -> (tokens_available=50, last_refill_time=14:30:10). (2) On request: compute elapsed_time since last refill. new_tokens = elapsed_time * refill_rate. tokens_available += new_tokens. If tokens_available > max_tokens, cap at max_tokens. (3) If tokens_available >= 1, allow request and decrement by 1. Else, reject. (4) Storage: use Redis HSET to store the tuple. INCR for efficiency. Example Redis key: "rate_limit:{user_id}:{api_key}" -> {"tokens": 50, "last_refill": 1680000010}. (5) Implementation: use Lua script in Redis for atomicity: deduct token, check if allowed, return result—all in one atomic operation. Memory savings: O(1) per rate-limited entity instead of O(window_size). For 1B users * 1KB per user = 1 GB total (vs 500 GB with sliding window). Query latency: < 5ms. Test: simulate 1B users, 100K requests/sec, measure memory and latency. Should be < 5GB memory, < 5ms latency at P99.

Follow-up: Token bucket handles requests, but how do you handle different "costs" for different operations? For example, a read request costs 1 token but a delete request costs 10 tokens.

Your rate limiter serves 200 microservices with different limit strategies: some need per-user limits, some per-IP, some per-API-key, and some per-account. The rate limiter has 100 different config combinations. When you update a config (e.g., increase limit for premium users), it takes 30 minutes to propagate through the system. This is too slow. Design a dynamic config system.

Config propagation latency is a deployment and caching issue. Current flow probably: update DB -> restart services -> new limits take effect. 30 minutes is typical for a canary rollout. For instant propagation: (1) Event-driven config updates: instead of restarting, subscribe to config changes. Use a message queue (Kafka, SNS) or pub/sub (Redis, Etcd). When config is updated: (a) Admin updates config in database. (b) Event is published: "rate_limit_config_updated". (c) All rate limiter instances receive event and update in-memory config. (d) Takes < 1 second. (2) In-memory config cache: each rate limiter instance maintains a local copy of all configs (sorted by service, limit type, etc.). On cache miss, check Redis. On config event, refresh cache. (3) Versioning: configs have versions (v1, v2, v3). When a config is updated, increment version. Services check version on periodic sync (every 10 seconds). If version changed, re-fetch config. (4) Rollback: if a config causes issues, roll back by publishing the old version. No deploy needed. (5) Gradual rollout: when updating a limit, canary to 10% of services first. Monitor error rates. If healthy, expand to 100%. This is managed via feature flags in the config. Implementation: (1) Config service: REST API to manage rate limit configs. Database (PostgreSQL) stores configs. (2) Event publisher: on config update, publish to Kafka topic "rate-limit-config-updates". (3) Rate limiter: subscribe to Kafka topic. On event, update local config cache. (4) Health check: if service doesn't receive config update within 60 seconds (circuit breaker), fallback to old config. (5) Testing: change config, verify all instances updated within 5 seconds. Simulate service outage, verify graceful fallback. Example: (a) Config: "user_123" limit is 100 req/min (version 42). (b) Admin updates to 200 req/min (version 43). (c) Event published. Rate limiter instances receive event within 500ms. All instances apply new limit. (d) Old instances still on version 42? Health check detects and updates them within 60 seconds. Result: config changes propagate in < 1 second instead of 30 minutes. This enables fast iterations on limits based on traffic patterns.

Follow-up: If a microservice is offline and misses the config update event, how do you catch it up when it comes back online?

A malicious user is trying to DDoS your API by bombarding it with requests. Your rate limiter correctly rejects their requests, but the overhead of checking the rate limit (Redis calls, computations) is consuming 30% of your API server's CPU. They're still disrupting the system. How do you mitigate?

DDoS attacks that target rate limiting logic itself are tricky. The limiter is supposed to protect you, but it becomes a bottleneck. Mitigations: (1) Multi-layer defense: (a) Network layer: use a CDN or WAF (Web Application Firewall) to drop obvious DDoS traffic before it reaches your servers. Example: all requests from the same IP within 100ms -> drop. (b) Coarse-grained limiting: simple checks before expensive checks. Example: rate limit by IP first (fast Redis lookup). If IP is rate-limited, don't check user-level limits. (c) Early rejection: if request fails basic checks (malformed headers, missing auth), reject before rate limiting. (2) Efficient rate limiter: (a) Use local in-memory checks first (no Redis). Track per-IP counters locally. Only sync to Redis for persistence. (b) Probabilistic counters: Count-Min Sketch or HyperLogLog for faster approximate counting. (3) Circuit breaker: if rate limiter latency exceeds threshold (> 100ms), return "429 Too Many Requests" without checking—fail fast. This prevents cascading overload. (4) Bypass for known safe traffic: if a request is from a trusted IP or user, skip rate limiting. Use a whitelist. (5) Gradual degradation: if rate limiter is overwhelmed, degrade to coarser limits. Instead of per-user limits, use per-IP limits only. Process fewer checks per second. (6) Separate infrastructure: deploy rate limiter on dedicated hardware, isolated from your main API servers. This way, DDoS to the limiter doesn't affect API throughput. (7) Adaptive limiting: if DDoS is detected (spike in requests from single IP), dynamically lower the limit for that IP. Example: user making 10K req/sec -> limit them to 10 req/sec. (8) Rate limit the rate limiter: meta-level—if the rate limiter receives > 100K limit-check requests/sec from a single IP, it stops checking and just rejects them. Implementation: (1) Add network layer defense: CloudFront + WAF blocking rules. (2) Optimize rate limiter: use local counting + periodic sync. (3) Implement circuit breaker: if Redis latency > 100ms, fail-open (allow request). (4) Separate infrastructure: move rate limiter to its own service with dedicated resources. (5) Test DDoS scenario: simulate 100K req/sec from attacker IP, verify (a) legitimate users aren't affected, (b) API server CPU doesn't spike, (c) attacker requests are rejected quickly. Result: with multi-layer defense, a DDoS attack is mitigated at the network level before it reaches your rate limiter. Even if it reaches the limiter, the limiter fails gracefully (circuit breaker) and your API servers stay healthy.

Follow-up: If you have a circuit breaker on the rate limiter (fail-open under load), don't you risk letting through the DDoS attack?

Your rate limiter is being used by an internal service that's buggy—it's calling the rate limiter 10x more than expected, causing limits to be consumed incorrectly. Other services are now rate-limited unfairly. How do you debug and isolate the buggy service without disrupting the rate limiter?

Service isolation and debugging without service restart: (1) Identify the culprit: (a) Enable detailed logging on the rate limiter. Log every check: service_id, user_id, limit_type, tokens_consumed, remaining. (b) Aggregate logs by service. Find the service making 10x requests. (c) Alert on anomalies: if a service goes from 1K requests/hour to 10K requests/hour, page on-call. (2) Isolate the service: (a) Add a per-service quota on top of user limits. Service X is limited to 100K requests/hour. If exceeded, reject all requests from that service (even for different users). This prevents one service from starving others. (b) Implement token bucket per service: allocate 1M tokens/minute to Service X. If exceeded, requests wait (backpressure) or fail. (3) Debug without restarting: (a) Deploy a shadow service: clone the buggy service and route 1% of traffic to the shadow. Monitor shadow closely. If it's making correct number of rate-limit calls, then the issue is in the main service. (b) Add tracing: instrument the service with distributed tracing (Jaeger, DataDog). Trace request flow: does it call rate-limit-check 10 times? For what reasons? (c) Check logs: service might be retrying on transient failures. Check error rates. (4) Temporary mitigation: (a) Reduce Service X's limits while debugging. This prevents cascading impact on other services. (b) Increase global rate limiter capacity: add more Redis replicas to handle the extra load. (c) Alert: notify on-call to investigate Service X. (5) Fix verification: (a) Once bug is fixed and deployed, monitor rate limiter requests from Service X. Should drop back to normal. (b) Compare metrics before/after: request count, tokens consumed, error rates. (6) RCA: after resolution, post-mortem: why wasn't this detected? Could we have caught it earlier? Implement automated anomaly detection for service-level behavior changes. Implementation: (1) Detailed metrics: emit per-service request counts to a time-series DB (Prometheus). Set alert thresholds (e.g., > 5x normal for 5 minutes). (2) Quota enforcement: rate limiter tracks per-service tokens. On exceeding quota, return different error code (e.g., 429 "Service quota exceeded"). Client sees distinct error. (3) Debugging interface: provide a rate-limiter debug dashboard showing per-service metrics, current tokens, history. Ops can see Service X is hogging 80% of tokens. Testing: simulate a buggy service hammering rate limiter, verify (1) it's isolated, (2) other services aren't affected, (3) system stays healthy, (4) on-call is alerted within 5 minutes.

Follow-up: If Service X is broken and consuming all rate limiter tokens, should you hard-reset its quota immediately or gradually decay it?

Your rate limiter must support three different limiting modes: (1) per-user limits, (2) per-API-key limits, and (3) per-account limits (multiple users, multiple API keys). A request from User A using Key 1 under Account X should be checked against all three limits. If any limit is exceeded, reject. But checking all three adds latency. Design for accuracy without performance hit.

Hierarchical rate limiting where a request must pass multiple checks. The challenge: do you check sequentially (slow) or in parallel (complex)? Design: (1) Sequential checks with fast-fail: check in order of likelihood to be hit: (a) Per-user limit (most granular, hits first). (b) Per-API-key limit (second). (c) Per-account limit (least likely to hit). If any check fails, return "rate limited" immediately. Don't check the rest. (2) Parallel checks using Redis pipeline: execute all three limit checks in a single Redis round-trip. Example: Redis EVALSHA script (Lua): for each check, INCR counter and compare with limit. Return status for all three. Latency: still ~5ms (one network round-trip) but three checks done. (3) Caching hierarchy: cache results of expensive checks. Example: account limit is checked per-minute (not per-request). Once checked, cache result in local memory for 30 seconds. Result: most requests only check user + API key (fast), account check only happens every 30 seconds. (4) Optimistic concurrency: assume most checks pass. Execute all three in parallel. Return "allowed" if all pass, "denied" if any fails. Retry logic if state changes between checks (rare race condition). (5) Short-circuit optimization: pre-compute "will this user hit any limit?" at request start. If user has consumed 99% of user limit already, only check that one (it will likely fail). This reduces unnecessary checks. Implementation: (1) Rate limiter API: `check_limits(user_id, api_key, account_id, limit_name)` (2) Redis schema: (a) Counter for user: `limit:user:{user_id}:{limit_name}` (b) Counter for API key: `limit:api_key:{api_key}:{limit_name}` (c) Counter for account: `limit:account:{account_id}:{limit_name}` (3) Lua script in Redis: ```lua for each limit_type in [user, api_key, account] do key = build_key(limit_type) current = INCR(key) if current > limit then DECR(key) return "denied" end end return "allowed" ``` (4) Expiry: all keys expire after the window (e.g., 1 minute). Redis handles automatic cleanup. (5) Monitoring: track per-limit rejection rates. If user limit is rejecting 50% of requests but API key limit is rejecting 0%, it indicates user limit is set too low. Example flow: User A, API Key 1, Account X makes request. Checks: (a) User A: 45/50 tokens. OK. (b) API Key 1: 100/100 tokens. OK. (c) Account X: 1000/1500 tokens. OK. All three pass -> allow request. Next request: (a) User A: 49/50 tokens. OK. (b) API Key 1: 101/100 tokens. DENIED. Return "rate limited" without checking account. Result: hierarchical checking with < 10ms latency using Redis pipeline + Lua scripts. Accuracy: 100% (atomic checks). Flexibility: supports arbitrary nesting of limits.

Follow-up: If you're doing hierarchical checks and the API key limit is hit, should you still deduct from the user limit counter?

Your rate limiter uses Redis, which occasionally has latency spikes (garbage collection pauses, network blips). When Redis is slow, all client requests pile up waiting for rate limit responses. How do you make the system resilient to Redis slowness?

Redis slowness cascading to clients is a common issue. Solutions: (1) Timeouts: set aggressive timeout on Redis calls (e.g., 50ms). If Redis doesn't respond within 50ms, return immediately with default behavior (allow or deny?). (2) Default behavior on timeout: (a) Optimistic: allow the request. Better user experience, but might allow over-limit requests during Redis outages. (b) Pessimistic: deny the request. Safer, but might reject legitimate requests. (c) Probabilistic: allow 80% of requests, deny 20% (sampling). Balances both. (3) Fallback to local cache: (a) Each rate limiter instance keeps a local in-memory cache of recent limit checks. (b) On Redis timeout, check local cache first. If user was rate-limited in the last 10 seconds (cached), deny. Else, allow with high confidence. (c) Local cache is probabilistic but fast (sub-millisecond). (4) Bulkheading: separate the rate-limit check from the request handler using a queue. (a) Request handler adds request to queue: "check_limit(user_id, api_key)". (b) Background worker thread processes queue, hits Redis, returns result. (c) Request handler waits for result with timeout (e.g., 100ms). If timeout, fail-open or use cached result. (d) This decouples slow Redis from slow API responses. (5) Redis replication + failover: (a) Primary Redis with read replicas. If primary is slow, read from replica (might be slightly stale, but acceptable). (b) Use Redis Cluster with multiple primaries. Each service queries a different primary, reducing contention. (c) On primary failure, automatic failover to replica (< 1 second). (6) Degraded mode: (a) If Redis latency > 1 second for more than 10 seconds, switch to degraded mode. (b) In degraded mode: skip exact rate limiting. Instead, use simple token bucket locally, sync to Redis infrequently (every 60 seconds). (c) Limits might be slightly violated, but system stays responsive. (7) Monitoring and alerting: (a) Track Redis latency percentiles (P50, P95, P99). (b) Alert if P99 > 100ms. Trigger investigation. (c) Track timeout rate: if > 5% of rate limit checks timeout, page on-call. Implementation: (1) Wrapper around Redis client with timeout + fallback logic. (2) Local cache (LRU map): store last 100K limit checks. TTL: 10 seconds. (3) Degraded mode flag: set by monitoring. When set, use local-only checking. (4) Testing: simulate Redis latency (add 200ms delay to all commands), verify (a) client requests still complete in < 100ms (via timeout + fallback), (b) degraded mode activates, (c) after Redis recovers, system returns to normal. Result: with timeout + fallback + degraded mode, rate limiter is resilient to Redis outages. Latency remains < 50ms even during Redis slowness.

Follow-up: If you're in degraded mode using local-only limits, and Redis recovers, how do you synchronize the state? Are there gaps?

Your rate limiter is used by a mix of internal services and third-party API consumers (enterprise customers). Internal services are trusted; third-party customers are not. A customer has found a bug in your API and is exploiting it to bypass rate limits. Other customers are now complaining they're being unfairly rate-limited. How do you isolate and remediate?

This is a security incident involving potential abuse. Immediate response: (1) Identify the exploited bug: review rate limiter logs for the attacking customer. Look for patterns: are they bypassing the limit by using multiple API keys? Using timing tricks? Sending crafted headers? (2) Isolate the customer: implement an emergency per-customer quota (separate from per-API-key quotas). Limit this customer to their contracted rate immediately. This stops the bleeding. (3) Restore fairness: for other customers experiencing degraded service, manually increase their allocated limits for the day to compensate (one-time adjustment). (4) Root cause analysis: (a) Is the bug in your rate limiter code? (e.g., off-by-one in token calculation) (b) Is it in the API gateway? (e.g., header parsing bug allows spoofing identity) (c) Is it in the client library? (e.g., customer is reusing connection in a way that bypasses tracking). (5) Fix: (a) Code fix: patch the bug. (b) Canary deploy: test on 5% of traffic first. (c) Monitor: watch for false positives (legitimate traffic being blocked). Implementation: (1) Emergency isolation: add customer_id to rate limit check. If this customer_id is on a blocklist, return 429 immediately. Keep this list in Redis for fast lookups. (2) Quota adjustment: API: `POST /rate-limit/customer/{customer_id}/temporary-override` with new_limit and duration. Ops can call this immediately. (3) Audit log: log all rate limit policy changes (increases, emergency quotas) with timestamp, operator, reason. Required for compliance. (4) Customer communication: notify affected customers: "We detected an exploit affecting rate limits. We've temporarily adjusted limits for all customers to restore fairness. We're fixing the bug and will restore normal operation within 2 hours. Sorry for disruption." (5) Bug prevention: add tests specifically for bypasses. Fuzzing with malformed headers, multiple API keys, parallel connections. Regression test suite should catch this. (6) Post-incident: review the bug with the customer (with security review). Do they want a bug bounty? Consider offering one (especially if they disclosed responsibly). Testing: (1) Simulate the exploit: confirm you can replicate it. (2) Deploy fix, verify exploit no longer works. (3) Test emergency isolation: add customer to blocklist, verify they get 429. (4) Verify other customers unaffected. This incident highlights the importance of: (1) per-customer isolation (multi-tenancy), (2) audit logging, (3) manual overrides for emergencies, (4) security testing.

Follow-up: If you discover the bug was already being exploited by multiple customers for days, how do you audit usage and charge fairly?