System Design Interview — Load Balancing Algorithms and Health Checks

Your load balancer uses round-robin across 4 backend servers. Suddenly, Server C has 3x latency (50ms → 150ms) but still responding to health checks. The LB keeps sending traffic equally (25% each). P99 latency spikes from 80ms to 200ms. How do you detect this and fix it without redeploying the LB?

Round-robin is blind to latency. Server C passes health check (responds 200 OK), so LB routes 25% of traffic to it. Those requests hit 150ms latency, dragging down P99. Detection: (1) Application metrics—monitor latency per backend via backend-attribution headers. Server C spike detected in 30-60 seconds. (2) LB access logs—analyze response times per upstream server. (3) Distributed tracing—correlate slow traces to Server C. Fix options: (1) Least-connections LB algorithm: route to servers with fewest active connections. Server C's connection queue grows, new requests route elsewhere. Takes ~30 seconds to converge. (2) Least-response-time: LB tracks latency per server, routes to fastest. Server C's latency increases, weight drops. (3) Weighted round-robin: dynamically adjust weight. Use metrics from (1), compute: weight_C = (avg_latency_others) / (avg_latency_C). If C is 3x slower, weight_C = 1/3. Manually push config or use auto-weighted (requires LB plugin). (4) Circuit breaker: if Server C latency > threshold for N consecutive requests, exclude from pool. Auto-add back after cooldown. Fastest fix: switch LB algorithm to least-connections or least-response-time (HAProxy, Nginx support both via config reload, no restart). Cost: ~2 seconds to apply config change via SIGHUP. Net: P99 latency recovers to 100ms in <2 minutes. Monitor: if Server C latency persists, investigate root cause (GC pause, slow disk, noisy neighbor) and drain/restart.

Follow-up: Least-response-time uses feedback from completed requests, which arrive ~100ms later. How does this amplify if Server C's latency is actually just first-request-slow?

A health check endpoint (GET /health) on your backend returns 200 OK. But the real service (payments API) is deadlocked and won't respond to requests. Payments requests to this server hang for 30 seconds (client timeout). You have 10 backends total. How do you make health checks more intelligent?

Simple health check (HTTP 200 OK) doesn't validate actual service readiness. False positive: backend reports healthy but is broken. Solutions: (1) Active dependency check: /health endpoint internally calls a canary payment API request (dummy $0.01 transaction). If payment service hangs, /health times out and returns 503. Cost: adds latency to health check (~100-500ms), must cache results. (2) Liveness + Readiness probes (Kubernetes style): /health/live checks if process is running (fast, <10ms). /health/ready checks if service is accepting requests (slower, calls payment service). LB uses /health/ready for routing decisions. (3) Deeper metrics: /health returns JSON with checks: {"database": "OK", "cache": "OK", "payment_service": "TIMEOUT"}. LB only excludes server if critical checks fail. (4) Gradual degradation: if payment_service is TIMEOUT but other checks pass, reduce weight (route 10% instead of 25%) rather than exclude. (5) Request-level fallback: add timeout wrapper to health check. If check takes >1 second, return 503 immediately rather than hang. Config: health check calls payment service with 500ms timeout; if timeout, return 503. LB removes server from pool after 2 consecutive 503s. Recommended combination: **(2) + (5)**. /health/ready calls payment service with 500ms timeout. LB excludes after 2 failures. This catches payment deadlock in <10 seconds and routes around it. Cost: adds ~500ms to health check interval, but prevents 30-second hangs for clients.

Follow-up: Your health check now makes external API calls. What if the API service is up but experiencing 100K request queue (not deadlock)? How does this degrade differently?

You have a 10-server cluster behind a load balancer. During a rolling deploy, you drain Server 1 (stop new connections, wait for existing to close), then restart it. But the LB continues to send new requests to Server 1 while it's draining. Traffic to Server 1 increased by 50% mid-drain. Explain the race condition and fix.

Drain command (e.g., HAProxy set server down) takes time to propagate. Race condition: (1) Admin issues drain command on Server 1. (2) LB config updates to exclude Server 1 *asynchronously*. (3) During update window (100-500ms), requests in-flight may still route to Server 1. (4) Load shifts to other 9 servers. (5) But if LB update hasn't propagated yet in all processes/caches, new requests still hit Server 1. Net: more requests route to Server 1 than expected because LB isn't aware it's draining. Fix options: (1) Two-phase drain: (a) LB config excludes Server 1 *immediately* (synchronous). (b) Then issue drain command on backend. Wait for existing connections to close (max timeout 30s). (c) Then restart. This ensures LB stops routing before drain waits. Cost: requires LB + backend coordination. (2) Longer drain timeout: increase from 30s to 120s. More time for in-flight requests to complete, fewer collisions. Cost: deploy time increases. (3) Use connection draining at LB level: HAProxy "drain" mode stops accepting new connections on Server 1 but completes in-flight. Implement via: set server drain + wait for connections to reach zero before next action. (4) SigTerminal handler: backend catches SIGTERM, stops accepting requests immediately, completes in-flight. LB health check timeout = 500ms, marks server down. New requests route elsewhere. Existing requests continue. Recommended: **(1) Two-phase** or **(4) SIGTERM handler**. With (4): Admin sends SIGTERM → backend stops accepting → LB detects down in <500ms via health check → traffic stops → existing requests complete → restart. Total time: 30 seconds. Prevents 50% traffic spike on other servers.

Follow-up: If Server 1 restart takes 2 minutes (database migrations), how do you handle the 9-server cluster being overloaded during restart?

Your LB uses HTTP health checks (GET /health every 5 seconds). One backend flaps: alternates between 200 OK and 500 Error every 3-4 seconds. LB oscillates, adding/removing server from pool. Requests fail mid-stream when server is removed. How do you stabilize this?

Flapping server causes oscillation. LB health check interval = 5 seconds. Server flaps every 3-4 seconds. Possible timing: (1) T=0s, server healthy (200 OK), LB adds it. (2) T=3s, server fails (500), but LB hasn't checked yet. (3) T=5s, LB checks, sees 500, removes server. (4) T=6s, server recovers (200 OK), but LB won't check for another 5 seconds. (5) T=10s, LB checks, sees 200, adds server back. Meanwhile, requests sent to server between T=3-5 fail. Fix: (1) **Flap detection**: require N consecutive failures (or successes) before changing state, not 1. E.g., require 3 consecutive 500s before removal, 3 consecutive 200s before addition. This prevents single flake from changing state. Cost: 15-30 second detection latency for real failures. (2) **Longer health check interval**: increase to 10 seconds. Fewer checks, less oscillation. Cost: slower to detect real failures. (3) **Health check smoothing**: use weighted averaging. Server flaps 50% of checks → assign weight 0.5. Route 50% to it, 50% to other servers. Prevents hard remove/add. (4) **Stabilization timeout**: once a server is marked down, keep it down for minimum 30 seconds before allowing re-add. Prevents rapid flapping. (5) **Root cause debugging**: flapping often indicates: memory pressure (GC pause), CPU throttle, or application deadlock/recovery cycle. Investigate why server flaps instead of band-aiding. Recommended: **(1) + (5)**. Set consecutive_failures=3, consecutive_successes=3. If still flapping, trigger alert to investigate root cause. Meanwhile, weight-based routing (3) reduces impact. Net: flapping server gets ~10-20% traffic (degraded but not removed), doesn't fail mid-stream.

Follow-up: Your flapping server is actually in a GC pause cycle (pause every 3 seconds). Does consecutive_failures=3 help, or do you need a different strategy?

You scale from 10 to 100 backends during a traffic spike. LB algorithm is round-robin. But traffic distribution is uneven: some servers get 2% of traffic, others get 0.8%. Why? And how do you fix it?

Round-robin distributes equally across *current* pool, but the pool is dynamic during scale-up. Uneven distribution causes: (1) **Distributed round-robin counter**: if LB is across multiple LB instances (common at scale), each instance maintains its own round-robin counter. Instance A: counter at 50. Instance B: counter at 25. When new servers added, counters don't reset uniformly. Instance A starts at new server 51, Instance B at new server 26. Traffic distribution = 2x / 100 for some, 0.8x / 100 for others. (2) **Consistent hashing deviation**: some load balancers use pseudo-random routing that mimics round-robin but is nondeterministic per request source. If source IP hashes to "odd" servers, distribution is skewed. (3) **Health check timing**: new servers are added but health checks haven't completed. Some servers receive traffic before ready, fail requests, get re-routed. Temporary uneven load. (4) **Connection pooling**: clients maintain persistent connections. If client connected to server A before scale-up, it stays on A even after scale. New clients distribute across all 100, but old clients are sticky. Net: if 50% of traffic is old clients (sticky), distribution is 50% server A + 50% split across 99 others. Fixes: (1) **Consistent hashing**: map client to server using hash(client_id) mod 100. When servers change, only 1/100 clients re-hash. Stable distribution. Cost: not true round-robin, slightly uneven due to hash distribution. (2) **Centralized counter**: LB instances share a distributed counter (Redis, etcd). All instances increment same counter, route to counter % num_servers. Cost: adds latency (network round-trip), introduces new failure point. (3) **Sticky sessions with re-balance**: clients stay on same server for session duration (5-30 min). After session ends, route new connections uniformly. Cost: less balanced during scale, but converges post-scale. Recommended: **(1) Consistent hashing** for stateless services. For stateful, **(3) sticky sessions with longer scale-up window** (5-10 min instead of 30 seconds) to allow clients to naturally rebalance.

Follow-up: Consistent hashing causes 90% of traffic to rehash when you go from 100 to 101 servers. Is this expected or a bug?

Your LB health check connects to backend via TCP port 8080. Network is flaky: 2% of health check requests timeout after 5 seconds. These timeouts cause backends to be marked down/up rapidly. How do you prevent timeouts from triggering false negatives?

Timeout != failure necessarily. Network flakiness causes legitimate requests to timeout. Health check sees timeout as 500 Error equivalent, marks server down. Then server recovers, marked back up. Oscillation happens. Fixes: (1) **Timeout grace period**: require multiple timeouts (e.g., 3 consecutive) before marking down. Single timeout ignored. Cost: 15-second detection latency. (2) **Separate timeout health check**: use different TCP port (e.g., 8081) for health checks that's less congested than main port 8080. Health check timeouts less likely. (3) **Increase health check timeout**: if 2% of requests timeout at 5 seconds, increase to 10 seconds. Fewer timeouts observed. Cost: slower to detect real failures, ties up LB resources longer per check. (4) **Retry on timeout**: LB retries health check immediately on timeout, counts only 2 consecutive timeouts as failure. Handles transient network flakes. (5) **Circuit breaker + exponential backoff**: first timeout, backoff 1 second before retry. Second timeout, backoff 2 seconds. Third timeout, mark down. Exponential spacing reduces noise. (6) **Adaptive timeout**: LB measures health check latency (e.g., 500ms median), sets timeout to 3x median (1.5 seconds) + jitter. Adapts to network conditions. Recommended: **(1) + (4) + (6)**. Require 3 consecutive failures with retry on first timeout, adaptive timeout based on recent latency. This filters out 2% transient flakes but still detects real server failures in ~30 seconds.

Follow-up: Your LB logs show health check success rate = 99.8% (very high), but live traffic shows 0.2% failures to same backends. Are these different issues?

You deploy a new LB algorithm: consistent hashing with virtual nodes. Before deploy, traffic distribution is perfectly even (100 servers, each gets ~1% traffic). After deploy, 5 servers get 3-5% traffic each, 95 servers get 0.5-1%. The total traffic is the same. Explain the bug and how you'd debug it.

Consistent hashing with virtual nodes should distribute evenly if virtual nodes are properly replicated. Uneven distribution suggests: (1) **Virtual node misconfiguration**: if 5 servers have 10 virtual nodes each and 95 servers have 1 virtual node, hash space is skewed. 5 servers own 50/105 = 47.6% of hash space, others own 0.95% each. Fix: equal virtual nodes per physical server (e.g., 150 vnodes × 100 servers = 15K total vnodes). (2) **Hash function bias**: hash function is not uniformly distributed. Some hash ranges cluster near 5 servers. Debug: generate random keys, hash them, plot distribution across vnodes. Should be uniform histogram. If 5 vnodes are getting 10x traffic, hash function is broken. (3) **Client routing bug**: clients don't hash consistently. Some clients use old config (round-robin), new clients use consistent hashing. Old clients route to 5 specific servers (coincidentally 5 out of 100 were previously "preferred" for old algorithm). Debug: log client routing decisions, compare hash to actual server chosen. (4) **Load wasn't actually even before**: if previous distribution was "perfectly even" due to measurement rounding (e.g., each server got 1% ±0.5%), new algorithm may expose real imbalance. Measure true distribution using entropy or variance. Debug steps: (1) Dump hash ring: print virtual node -> physical server mapping. Visual inspect for clustering. (2) Generate test keys: hash 1M random keys, count distribution across physical servers. Plot histogram. Should be flat (each server ~10K keys). (3) Compare to live traffic: grab traffic logs, extract client IDs, hash them, verify they match LB routing decision. (4) If mismatch, check if some clients are bypassing LB (direct connection to 5 servers?). Likely root cause: virtual nodes misconfigured or hash function biased. Solution: re-generate hash ring with equal vnodes per server, verify hash function with test keys, re-deploy.

Follow-up: You increase virtual nodes from 10 to 100 per server. Distribution now even again. But startup latency increased 50% (hash ring construction takes 2 seconds). What's the tradeoff?

Your LB is behind another LB (nested LB architecture, common in CDN: client -> CDN LB -> origin LB -> backends). A backend update happens: drain one server. CDN LB's health check sees origin LB still healthy (because it's up), but that origin LB is now sending traffic to 9 servers instead of 10. Origin LB is saturated, causing 30% request timeouts. CDN LB doesn't know origin is degraded. How do you propagate backend saturation upwards?

Nested LB architecture hides degradation. CDN LB sees origin LB as single entity: "healthy" or "down". No visibility into origin LB's backend saturation. When origin LB loses a backend (drain), CDN LB doesn't adjust, keeps routing same traffic. Origin LB's remaining 9 backends get overloaded. Request queues grow, timeouts increase. CDN LB won't detect this for 30-60 seconds (health check interval). Solutions: (1) **HTTP 503 Service Unavailable from origin LB**: if origin LB detects saturation (queue depth > threshold), return 503 to CDN LB. CDN LB sees 503, marks origin as degraded, routes traffic elsewhere. Cost: 10-30 second detection latency. (2) **Custom metrics in health check response**: origin LB health endpoint returns JSON: {"status": "healthy", "backend_count": 9, "avg_latency_ms": 450, "queue_depth": 1500}. CDN LB parses metrics, adjusts weight. If backend_count drops from 10 to 9, CDN reduces weight by 10%. Cost: requires custom parsing, more complex. (3) **Weighted load balancing with latency feedback**: CDN LB sends health checks with feedback. Origin LB responds with latency percentile. If P99 > threshold, CDN LB reduces weight. Cost: more complex negotiation, potential feedback oscillation. (4) **API-driven draining**: when origin LB wants to drain a backend, it proactively notifies CDN LB: "reducing capacity by 10%." CDN LB pre-drains a proportional share of traffic. Then origin LB drains backend. Requires CDN LB to trust and obey origin LB's commands. (5) **In-band signaling**: origin LB sends status code 429 (Too Many Requests) or 503 when saturated. CDN LB interprets 429 as "slow down, I'm overloaded" and reduces traffic immediately. Recommended: **(1) + (2)**. Origin LB returns 503 if avg_latency > 500ms or queue_depth > 1000. CDN LB detects 503 + reduced weight immediately. Within 30 seconds, traffic load-shifts. Net: timeout reduction from 30% to <5%.

Follow-up: If origin LB can't change its response code (legacy system), how would you instrument saturation detection at CDN LB level without modifying origin?