System Design Interview — Rate Limiting at Scale

Your API is getting abuse traffic: 100K RPS normal, suddenly spiking to 500K RPS (5x surge). Requests are from 1000s of different IPs (distributed attack). You need to rate limit at 100K RPS per user (by API key), but your current in-process rate limiter (local memory) doesn't coordinate across 20 API gateway instances. Some users get through 200K RPS by hitting different gateway instances. How do you design a distributed rate limiter?

In-process rate limiting doesn't coordinate—each instance has independent quota. For distributed rate limiting, use external coordinator (Redis cluster or DynamoDB). Recommended: Redis token bucket. Architecture: (1) Redis cluster (3 nodes, multi-AZ) stores rate limit counters: `bucket:user_123:minute → 50000 (tokens remaining)`. (2) Each API gateway instance uses Lua script (atomic on Redis): `curr = INCR bucket:user_123:minute; if curr > 100000 then return 429 (too many requests); else return 200`. Lua script ensures atomicity—no race condition where two instances both increment and both allow the request. (3) Key structure: `bucket:user_id:minute` with TTL 61 seconds (auto-expires). Sliding window: new minute starts, old bucket expires, new bucket created. (4) For 100K RPS sustained + 500K RPS spike: Redis cluster (3 nodes × 8GB = 24GB) can handle 1M ops/sec (each node ~300K ops/sec). Cost: Redis cluster = $2K/month (managed AWS). Single-region latency: 1-5ms per rate limit check. (5) Fallback: if Redis is down, fall back to in-process rate limiter (best-effort, some users might exceed quota, but API stays up). Alternative architecture (lower latency): use local cache (Redis in-process) + async sync. API gateway maintains local token bucket, periodically syncs to central Redis. Pro: sub-millisecond latency. Con: temporary inconsistency—if instance A allows 50K and crashes before syncing, tokens are lost. For public APIs where some over-limit is acceptable, this works. Expected outcome: distributed rate limiting coordinate across all 20 instances. Attack traffic at 500K RPS is throttled back to 100K RPS (rate-limited users get 429 errors). Legitimate users at <100K RPS are unaffected.

Follow-up: Your rate limiter works, but legitimate users in certain regions see high 429 rates. Diagnosis: a single compromised API key has allocated quota, but a bug is causing it to make 10K requests/sec (out of 100K limit). How do you identify which user is the culprit?

You implement per-user rate limiting: 1000 requests/hour per free tier, 10K requests/hour per pro tier. Your metrics show 40% of requests are being rate-limited. But you're also rejecting legit traffic bursts (user's batch job does 500 requests in 1 second, hits rate limit for the rest of the day). You need to allow bursty traffic without losing protection against abuse. How do you design this?

Strict per-hour quota doesn't account for burstiness. Better: token bucket with refill rate + burst allowance. Token bucket algorithm: (1) Bucket size = burst allowance (e.g., 1000 tokens for free tier). (2) Refill rate = hourly quota / 3600 seconds (1000 tokens / hour = 0.28 tokens/sec, ~1 token every 3.6 seconds). (3) Request cost = 1 token per request. (4) If bucket has tokens, request succeeds and consumes 1 token. If empty, reject. (5) Tokens refill continuously at rate (1 token per 3.6 seconds for free tier). Result: free tier can burst 1000 requests, then rate-limited to 1 request per 3.6 seconds. After 1 hour at steady state (1 request per 3.6s), bucket refills to 1000 and user can burst again. Configuration example: free tier (1000 tokens, 0.28 refill/sec), pro tier (10K tokens, 2.8 refill/sec). Cost: Redis to track per-user bucket state. Scale: 1M users, each with one bucket state = 1GB memory in Redis (manageable). Alternative (simpler, less fair): sliding window + per-second quota. Free tier = max 100 requests/second (allows bursts of 100, then throttles to 100/sec). Pro tier = 1000 requests/second. This is easier to explain to users and implement. Trade: less fairness (user who uses quota early in hour can't use quota late in hour, it's all-or-nothing). Token bucket is better for SaaS (users like steady access + burst allowance). Expected outcome: 40% reject rate down to <5%. Bursty users can do 500 req/sec for 2 seconds (uses 1000 token burst allowance), then throttled to steady rate. Abuse (10K req/sec) still rejected immediately (tokens depleted).

Follow-up: Your pro-tier user has a burst allowance of 10K tokens, but their monitoring dashboard accidentally sends 50K requests in 5 seconds (client bug). You reject 40K requests with 429 errors. The user demands refund, saying your rate limit is unfair. How do you handle this?

You operate a social media platform. Celebrities with 10M followers receive sudden traffic spikes: tweet goes viral, 100K requests/sec from fans retweeting/liking. Normal users get 1K requests/sec. Your rate limiter limits all users to 10K requests/sec fairly. During viral moments, celebrity's API hits the wall and returns 429 (rate limited). Your growth team complains: "Why are we rejecting celebrity traffic? That's our core business!" But if you exempt celebrities, attackers will impersonate them. How do you design this?

This is a tradeoff between fairness and business priority. Three approaches: (1) Static exemption (simplest, but security risk): whitelist celebrity API keys, set their quota to 1M req/sec. Pro: celebrities never hit rate limit. Con: attackers compromise celebrity key, bypass rate limiting entirely. Mitigate: use IP-based rate limiting in addition (if celebrity's traffic comes from known datacenter IPs, allow higher quota). (2) Priority-based rate limiting: instead of hard limit, use weighted fair queuing. Assign priority levels: celebrities = priority 1 (99th percentile latency), power users = priority 2 (95th percentile), normal users = priority 3 (75th percentile). When overloaded, prioritize requests: priority 1 gets served first, priority 3 gets dropped first. Implementation: API gateway maintains queue per priority level. During normal traffic, all users served equally. During spike, celebrity traffic gets queue precedence. (3) Cost-based model: charge per request (not fixed quota). Celebrity pays $0.001 per request, can make 100K requests for $100. Attacker making 100K requests for $100 is expensive enough to deter (DDoS economics—attackers want free abuse). Pro: economically disincents abuse. Con: requires billing infrastructure. My recommendation: approach (2) (priority-based rate limiting) + approach (1) (IP whitelist for known celebrity locations). Implementation: (a) Assign API keys to priority tiers based on follower count (automated, updated daily). (b) Rate limiter: maintain separate queues per tier. If system load >80%, drop lowest-priority requests first. (c) Metrics: track queue depth per tier, alert if celebrity queue depth >5 seconds (indicates under-provisioning). Expected outcome: during viral spike, celebrity requests get served at 99.9% success rate (1 dropped per 1000), normal users at 95% success rate (50 dropped per 1000). Fairness is compromised, but business priorities are met.

Follow-up: Your priority-based rate limiter works, but now attackers create fake celebrity accounts to get priority treatment. How do you verify that a celebrity account is legit?

You're rate limiting by IP address to protect against bot traffic: 100 requests/minute per IP. During a corporate event at a tech company, 500 employees share the same office IP (behind NAT). They all access your API, hit 100 req/minute limit, get rate-limited. The office can't access your API. How do you fix this without breaking the rate limit for actual attackers?

IP-based rate limiting fails when many users share the same IP (NAT, corporate proxies, CDNs). Solution: hybrid approach—rate limit by API key (if provided), fall back to IP if not. Architecture: (1) Authenticated requests (API key provided): rate limit by API key (e.g., 10K req/min per key), not by IP. API key is stable across multiple IPs (user moves between networks). (2) Unauthenticated requests (no API key): rate limit by IP (100 req/min per IP). This protects against bot scraping without credentials. (3) For corporate IP behind NAT: users get API keys, rate limiting per-key instead of per-IP. Corporate can make 500 users × 100 req/min = 50K req/min total (if each has key). (4) If corporate NAT blocks you: implement IP bypass mechanism. Corporate admin can register their IP range with you: `corporate_ip_range: 203.0.113.0/24 → high_quota: 10K req/min`. Verify domain ownership (corporate email + domain verification). Cost: API key management infrastructure (or use OAuth/SSO, most platforms already have). Implementation: during API gateway request, check `Authorization` header. If present, extract API key, rate limit per-key. If absent, rate limit per-IP. Expected outcome: authenticated users never hit NAT rate-limit issues. Attackers without credentials still limited to 100 req/min/IP (deterrent). Caveat: API keys can be leaked—if hacker gets corporate API key, they can abuse 10K req/min limit. Mitigate: per-user rate limits (each corporate user gets 500 req/min limit), not entire corporate quota shared. Track API key usage, alert on anomalies (10K req/min at 3am when company is closed).

Follow-up: An attacker steals a corporate API key (10K req/min quota). They make 10K req/min for 6 hours before you detect it. You've burned 360K requests of that key's quota. How do you compensate the corporate customer?

Your payment processing API needs aggressive rate limiting: 10 requests/minute per API key (fraud prevention). But during a traffic incident, your rate limiter became a bottleneck: checking Redis for rate limit bucket took 50ms per request (network latency × high load). At 100K RPS, this added 5 seconds of latency to your payment API (unacceptable). How do you optimize rate limiting latency?

Redis network latency is the bottleneck. You need sub-millisecond rate limit checks. Solutions: (1) Local in-memory cache (best latency). API gateway maintains local token bucket cache per key (LRU cache, 100K most-active keys). Check local cache first (1μs lookup). If key not in cache, fetch from Redis (50ms), populate cache. Downside: eventual consistency (if instance A increments bucket and crashes before syncing, Redis doesn't reflect the change). For fraud prevention, use aggressive cache TTL (10 seconds): every 10 seconds, refresh cache from Redis. Cost: memory per gateway (100K keys × 100 bytes = 10MB per instance, 20 instances = 200MB total). (2) Colocated rate limiter (medium latency). Deploy Redis in-memory cache on same machine as API gateway (via Unix socket instead of network). Latency: 1ms (local call, no network). Cost: operational complexity (manage Redis replica per gateway). (3) Batch rate limit checks (latency amortization). Instead of checking rate limit per-request, accumulate requests in local buffer (10ms window, batch 1000 requests), send one batch check to Redis. Reduces network calls 1000x. Trade: up to 10ms stale information per request. For payment fraud detection, this is acceptable (fraud patterns emerge over minutes, not milliseconds). Recommendation for payment API: approach (1) (local cache + 10-second refresh). Latency: <1ms per request. Cost: minimal (10MB memory). Fallback to approach (3) (batch rate limit checks) if local cache causes synchronization issues. Expected result: rate limit checks from 50ms → <1ms. Payment API latency down from 5s → 100ms (50ms from rate checking + other processing). Risk: local cache may diverge from Redis. Mitigation: metrics—track cache hit rate, alert if <95% (indicates stale cache). If alert, flush cache and refresh from Redis immediately.

Follow-up: Your local cache shows user_123 has 10 tokens remaining. User makes 1 request (down to 9 tokens). Meanwhile, their token bucket in Redis expires and resets to 10 tokens. Cache now shows 9 tokens (stale), but Redis shows 10 tokens. User bypasses rate limiting. How do you prevent this state divergence?

You have two datacenters: DC1 (primary, 80K RPS) and DC2 (backup, 20K RPS). Your rate limiter uses a single Redis cluster in DC1. During network partition (DC1 and DC2 split), DC2 can't reach DC1 Redis. Rate limit checks in DC2 fail (Redis timeout = 5 seconds), causing 5-second request latency. Your SLA is 500ms. How do you handle this?

Single Redis in DC1 creates availability issue. When DC2 is partitioned from DC1, rate limiting breaks. Solution: distributed rate limiter with quorum writes. (1) Multi-region Redis (active-active replication). Redis cluster spans DC1 and DC2, synchronized in ~100ms. Rate limit writes go to both clusters (quorum: 2 of 2). If write succeeds in both, operation is confirmed. If DC2 Redis is unreachable, operation fails (reject request to be safe). (2) Fallback to per-datacenter rate limiter. When network partition detected (Redis request timeout >1s), switch to fallback: use local in-memory rate limiter in DC2 (no network call, <1ms latency). Downside: eventual consistency—DC2 rate limit state diverges from DC1. When partition heals, reconcile: prefer DC1 state (primary is source of truth). (3) Quorum reads (consensus-based). Rate limit checks use quorum: need response from 2 of 3 Redis nodes. If 2 out of 3 respond, allow request. This guarantees consistency as long as partition doesn't isolate >1 node. Cost: 3-node Redis cluster = $3K/month (vs single node $1K). Availability benefit: survives single node failure. (4) For your case (DC1 primary, DC2 backup), I'd recommend: (a) Redis in DC1 (primary). (b) Local in-memory fallback cache in DC2. On network partition, DC2 uses local cache (degraded but available). (c) Metrics: track fallback activations (should be rare <5min/month). (d) Monitoring: alert on partition (latency to DC1 Redis >5s sustained). Failover team manually decides: reconcile after partition heals. Expected outcome: during partition, DC2 maintains availability (local fallback), no 5-second timeouts. After partition heals, state reconciled. Trade: during partition, DC2 rate limits may be stale (user bumps quota in DC1, but DC2 cache doesn't know). Acceptable risk for sub-500ms latency requirement.

Follow-up: During partition healing, DC2 reconciles with DC1 Redis. But some users in DC2 made extra requests (beyond quota) during partition, using stale cache. How do you handle quota overages after reconciliation?

You're implementing global rate limiting across 50 API endpoints. Each endpoint has different quota: `/api/users` = 1000 req/min, `/api/products` = 10K req/min, `/api/checkout` = 100 req/min (critical path). Your current approach: separate Redis key per endpoint per user (`user_123:endpoint_users:minute`). This works, but becomes hard to reason about. A user asks: "What's my total request quota across all endpoints?" You don't have a good answer. How do you redesign this to be more flexible?

Current design (separate bucket per endpoint) doesn't let you set global quotas. Redesign: hierarchical rate limiting. (1) Global quota tier: user_123 → tier "pro" (10K requests/hour globally). (2) Endpoint-specific quota: tier "pro" + endpoint "/api/checkout" → 100 req/min (more restrictive than global). (3) Rate limiter logic: for each request, check (a) global quota (user_123:global:hour → 10K remaining), (b) endpoint quota (user_123:checkout:minute → 100 remaining). If either is exceeded, reject. Otherwise, decrement both. (4) Key structure: (a) Global bucket: `rate_limit:global:user_123:hour → token_count`. (b) Endpoint bucket: `rate_limit:endpoint:checkout:user_123:minute → token_count`. (5) Answer to user query: check global bucket in Redis, return tokens remaining. Implementation: API endpoint `/quota/status` returns: `{ global: { requests_remaining: 8500, period: "hour" }, endpoints: { checkout: { requests_remaining: 87, period: "minute" }, users: { requests_remaining: 987, period: "minute" } } }`. This lets users understand their quota usage across all endpoints. Cost: slight increase in Redis keys (one global + one per endpoint per user). For 1M users, 50 endpoints = 51M Redis keys (~500MB). Expected outcome: transparent quota tracking, users know their limits, ops team can reason about rate limiting policy hierarchically. Trade: rate limit checks slightly more complex (check 2 buckets instead of 1), latency increase <1ms (both buckets in single Redis call via Lua script).

Follow-up: Your user churns and leaves your service. You need to clean up their rate limit buckets in Redis. But Redis has 1M user buckets across 50 endpoints. How do you efficiently garbage-collect expired user buckets without manual intervention?

You're rolling out rate limiting to an existing API with 1M active users who have never been rate-limited before. You set quota at 1000 requests/hour (reasonable baseline). On day 1, 40% of users hit the limit during normal usage. They're angry. You analyze: top 5% of users consume 95% of API traffic (power users who legitimately need high quota). Top 1% of users are abusers (bot traffic, malicious users). How do you design a fair and sustainable rate limiting policy?

This is a tiering + ML-based anomaly detection problem. Blanket rate limits hurt legitimate power users. Solution: (1) Usage-based tiers (freemium model). Observe user traffic patterns over 7 days before enforcing limits. (a) Tier 1 (free): 1000 requests/hour (baseline). (b) Tier 2 (pro): 10K requests/hour (+$10/month). (c) Tier 3 (enterprise): unlimited (contact sales). Automatically promote users to Tier 2 if they consistently exceed Tier 1 quota (e.g., >80% of quota for 3 consecutive days). Send email: "You've hit rate limit 5 times this week. Upgrade to Pro for unlimited access." (2) Anomaly detection for abuse. Train ML model to identify bot traffic: (a) Feature: request patterns (time-of-day, IP geolocation, User-Agent). (b) Label: manual review of top 1% users (identify bots vs legitimate power users). (c) Model: classify new requests as bot vs human. Rate-limit bots aggressively (10 req/sec), humans normally (1000 req/hour). (3) Gradual rollout. Day 1: rate-limit only detected bots. Day 7: analyze results, adjust quota if needed. Day 14: enable full rate limiting for all users. (4) Metrics: track rate-limit rejections per tier. If any tier has >5% rejection rate, alert (too strict). If <1% rejection, quota is too lenient. (5) Communication: notify users before enforcing limits. Email: "We're introducing rate limits on April 15. Your current usage: 2500 req/hour. Tier 1 limit: 1000 req/hour. Upgrade to Tier 2 or reduce usage." Gives users 2 weeks to adjust. Expected outcome: adoption rate up 80% (vs 40%), abuse down 95%, fair quotas per tier. Trade: tier-based model requires payment processing infrastructure (Stripe integration ~2 weeks). Alternative: simple: increase quota for top 5% users (observe > 5000 req/hour, auto-promote). Pro: fast. Con: doesn't address abuse (bots also get high quota).

Follow-up: After implementing usage-based tiers, a Pro user ($10/month) maxes out their 10K req/hour quota. They demand upgrade to Enterprise (unlimited). You have 1000 Pro users requesting this. How do you handle upgrade surge while maintaining revenue model?