System Design Interview — Distributed Locks and Leader Election

Your billing system runs a cron job daily: charges customers, generates invoices, updates accounting ledger. You run 3 cron instances for high availability. Two weeks ago, instances weren't synchronized—both ran the same billing cycle, charging customers twice. $2M in erroneous charges before auto-detection kicked in 4 hours later. How do you prevent this?

You need distributed mutual exclusion—only one instance runs billing at a time. Solution: distributed lock. Before running billing, acquire a lock; if you can't acquire it within 5 seconds, skip this run. Architecture: (1) Lock storage: Redis or etcd. Redis simpler, etcd more reliable (Raft consensus). Use Redis with SET NX command: `SET lock:billing:daily NX EX 3600 value={instance_id}`. This atomically sets key if not exists, with 1-hour expiry. Only one instance succeeds; others fail (lock exists). (2) Lock owner: store instance ID (`i-123abc`) as value. Allows monitoring: "which instance has the lock?" (3) Lock safety: use TTL (expiry time = 1 hour). If instance crashes while holding lock, lock auto-expires after 1 hour. Prevents deadlock. (4) Lock release: only the instance holding the lock should release (not another instance). Before releasing, check: `if GET lock:billing:daily == instance_id then DELETE lock:billing:daily`. Lua script for atomicity: `if redis.call('get', 'lock:billing:daily') == ARGV[1] then return redis.call('del', 'lock:billing:daily') else return 0 end`. (5) Billing logic: (a) Try acquire lock. (b) If success, run billing (charge, invoice, ledger). (c) Release lock. (d) If failure (lock held by another instance), skip this run. (6) Monitoring: alert if lock is held >2 hours (indicates hung instance). Manual intervention: release lock if needed. Cost: Redis lock = negligible (one write + one delete per run). Expected outcome: only one instance runs billing per day. No double-charging. Trade: if lock holder crashes mid-billing, next run retries. Idempotency is critical here—billing must be idempotent (same invoice ID, same amount, repeated charges are no-ops). Implementation: store billing_cycle_id in ledger, check if already processed before charging again.

Follow-up: Instance A acquires billing lock, starts charging. Instance A crashes at 50% billing completion (500 customers charged, 500 not). Lock expires after 1 hour. Instance B acquires lock, runs billing again. Now first 500 customers are charged twice. How do you prevent this?

You run a distributed database with 5 nodes. To detect node failures and recover, you need to elect a new leader automatically. You tried implementing leader election via frequent polling (every 10 seconds), but nodes competing for leadership caused 30 seconds of downtime. You need faster detection and failover. How do you design this?

Polling-based detection is slow (up to 10s latency per poll cycle). Better: use heartbeat + consensus. Architecture: (1) Heartbeat mechanism: leader sends heartbeat to all followers every 1 second (via gRPC/TCP keep-alive). Followers reset a timer upon heartbeat. If timer exceeds 3 seconds without heartbeat, assume leader is dead. (2) Consensus algorithm: use Raft (industry standard) or Paxos. Raft is simpler: when followers detect leader dead, they trigger election. Random follower waits 150-300ms (randomized to avoid split votes), then broadcasts vote request: "I'm candidate, vote for me." Followers vote for first candidate they see (within 3-second election timeout). Candidate needs majority votes (3 of 5) to become leader. New leader elected in <300ms. (3) Implementation: Raft library (etcd, Consul, or hashicorp/raft). Don't implement Raft from scratch—it's complex. (4) Deployment: 5-node cluster. Leader is node-1, followers are node-2, 3, 4, 5. When node-1 fails: heartbeat stops. Followers wait 1-3 seconds, trigger election. Node-2 or node-3 becomes leader within 300ms. (5) Data consistency: Raft ensures that only logs replicated to majority are applied (strong safety guarantee). New leader has all committed data. Downtime: <500ms (1s heartbeat timeout + 300ms election). Expected result: leader failure detected in 1-3 seconds, new leader elected in <300ms. Total downtime ~1-3 seconds (acceptable for HA systems). Cost: Raft implementation (open source, free). Trade: adds operational complexity (need to monitor Raft state, debug election issues). Caveat: during network partition, Raft can split into minority and majority partitions. Minority partition has no leader (partition isn't viable). Majority partition elects leader, continues operating. When partition heals, minority adopts majority's state (eventual consistency).

Follow-up: Your 5-node cluster partitions: nodes 1, 2 vs nodes 3, 4, 5. Majority partition (3, 4, 5) elects node-3 as leader. Minority partition (1, 2) has no leader (can't achieve majority). After 10 seconds, nodes 1 and 2 have uncommitted writes that nodes 3, 4, 5 don't have. Partition heals. How do you reconcile the diverged state?

You're building a distributed task scheduler: tasks are submitted to a queue (Redis), workers poll the queue and execute tasks. You want to ensure each task is executed exactly once. Currently, after a worker executes a task, it deletes the task from queue. But if the worker crashes before acknowledging completion to a persistent log, the task is lost. How do you add durability?

This is a state machine coordination problem. Current issue: task execution isn't durable. Fix: use task log (persistent append-only log). Architecture: (1) Task states: Pending → Processing → Complete. (2) Workflow: (a) Task submitted: written to queue + persistent task log (PostgreSQL). Initial state: Pending. (b) Worker polls queue: sees task, acquires work lock (distributed lock with TTL = 10 minutes). Updates state in DB: Processing. (c) Worker executes task. If success, writes result to persistent log. If crash, lock expires after 10 minutes, another worker picks up task (state still = Processing). (d) Worker updates state: Complete. Deletes from queue. (3) Durability: task log is the source of truth. All state changes are logged. On recovery, query log: tasks in Processing state >10 minutes → timeout, reset to Pending, re-queue. (4) Idempotency: ensure task execution is idempotent (same input = same output). Store task execution ID in log: `{task_id, execution_id, result}`. If worker retries same execution, result is cached (idempotent). (5) Monitoring: track task state distribution. Alert if tasks stuck in Processing state >30 minutes (indicate worker crash). Cost: task log storage (~$1/month), distributed lock (~$0), work lock TTL coordination ($0). Expected outcome: no lost tasks. Even if worker crashes, task is retried by another worker. Durability: 100% (as long as DB survives). Trade: slight latency increase (query DB for each state transition), but acceptable (<50ms per transition).

Follow-up: Your task is "send email to user". It's idempotent because you store the email ID and don't resend if it exists. But after 6 task retries, email still wasn't sent (email service is down). User never receives email. How do you detect and alert on this?

You're implementing a job queue with 100 workers across 5 datacenters. Each job needs exclusive access to a resource (e.g., backing up database DB-01). Multiple jobs can run in parallel, but no two jobs can touch the same resource. How do you design resource locking?

This requires distributed per-resource locking. Architecture: (1) Lock namespace: each resource has a lock key. `lock:resource:db-01`, `lock:resource:db-02`, etc. (2) Job acquisition: before starting job, worker tries to acquire lock for resource. `SET lock:resource:db-01 NX EX 3600 value={worker_id}`. If success, job starts. If fail (lock exists), job waits in queue or retries later. (3) Lock scope: TTL = job duration + buffer (e.g., backup takes 10 minutes, set TTL = 20 minutes). If job finishes early, explicitly release lock. If crash, lock expires after 20 minutes, another worker acquires. (4) Ordering: if multiple workers want same resource, use fair queuing. Option A (simpler): FIFO queue per resource. Worker joins queue: `LPUSH queue:resource:db-01 {worker_id}`. When lock released, dequeue next worker and grant lock. Option B (simpler, lock-free): exponential backoff. Worker retries lock acquisition with exponential backoff (1s, 2s, 4s, 8s). Low contention, but can starve workers. (5) Monitoring: track lock hold times per resource. Alert if any lock held >30 minutes (hung job). Alert on queue depth (if >10 workers waiting, resource is bottleneck). Cost: Redis keys + queues = ~$0/month. Expected outcome: mutual exclusion per resource. No two jobs touch DB-01 simultaneously. Scale: up to 100 workers, 50 resources, negligible latency (<1ms per lock acquisition). Trade: if resource is over-subscribed (many jobs need it), workers queue up (not ideal, but safe). Mitigation: add more resources (hardware), or deprioritize low-value jobs.

Follow-up: Two workers deadlock: Worker A holds lock:resource:db-01, waits for lock:resource:db-02. Worker B holds lock:resource:db-02, waits for lock:resource:db-01. How do you detect and break this deadlock?

You operate an ad auction platform: when a user visits a page, 100 advertisers bid simultaneously via gRPC. The highest bidder wins and the ad is shown. But during high-traffic events, auction crashes because all bidders try to acquire a lock on the auction object (lock contention). Current solution: all bidders wait in a queue, only one acquires lock at a time = slow auction (500ms instead of 50ms). How do you redesign this?

Lock-per-auction is a bottleneck. Redesign: use lock-free data structures. Instead of lock, use atomic compare-and-swap (CAS). Architecture: (1) Auction state: stored in memory (Redis or in-process). Current high bid: `auction_123 → {bidder: advertiser_A, price: $5.00, timestamp: t}`. (2) Bidding protocol: bidder reads current high bid, computes new bid (e.g., $5.01), tries CAS: `if auction_123.bid == $5.00 then set auction_123.bid = $5.01`. CAS is atomic (no lock needed). If CAS succeeds, bidder won. If fails (another bidder already outbid), bidder retries with new high bid. (3) Implementation (Redis): use Lua script or Redis INCR for atomic updates. Better: use Redis multi-exec transaction or CAS via script. Or simpler: use in-memory data structure (Go sync/atomic, Java concurrent classes). (4) Contention handling: if CAS fails, bidder retries immediately (exponential backoff if high contention). Expected: <10 retries per bidder (under normal load). (5) Auction timeout: bidding window = 50ms. At t=50ms, close auction, ship to winner. Trade: no explicit mutual exclusion, but atomicity via CAS prevents corruption. All 100 bidders can bid in parallel—no queue. (6) Correctness: assume bidders are well-behaved (don't lie about bids). If malicious bidder lies (claims price $10 but actually owes $5), that's a different problem (fraud detection). Monitoring: track CAS retry rate. If >50% CAS fail rate, indicates high contention. Scale: 100 bidders, 50ms window, each bidder makes 1-10 bids = 100-1000 total CAS operations. CAS latency: 1-5μs per operation (negligible). Expected auction time: 50ms (design constraint, not bottleneck). Cost: negligible (in-memory, no lock coordination). Alternative (simpler, higher latency): use local leader election. One bidder is designated "auctioneer" (via lock), collects all bids, determines winner, announces. Latency: 50-100ms (lock acquisition + collection + announcement). Not as good as lock-free, but simpler to reason about.

Follow-up: Your auction high bid is `$5.00`. Three bidders try CAS: bidder_A → $5.01, bidder_B → $5.01, bidder_C → $5.02. What's the order of CAS operations and which bidder wins?

You operate a sharded database (10 shards, hash-based partitioning). You want to rebalance shards: move shard-5's data to shards-5b (new hardware). During rebalancing, both old and new shard handle reads/writes. You need to ensure consistency—no read misses updated data, no write conflicts. How do you handle this?

Shard rebalancing requires distributed consensus to transition from old shard to new shard safely. Architecture: (1) Rebalancing protocol: (a) Phase 1 (Initialize): mark shard-5 as "rebalancing". New shard-5b comes online, starts as replica (read-only). (b) Phase 2 (Catch-up): copy all existing data from shard-5 to shard-5b. (c) Phase 3 (Transition): prepare clients for switchover. Clients no longer send writes to shard-5, only shard-5b. Remaining reads can come from either (stale reads acceptable). (d) Phase 4 (Cutover): after all writes redirected to shard-5b, old shard-5 is deprecated. (2) Consistency: use version vectors or timestamps to detect stale reads. Each object has version: `{key: user_123, version: 42, data: ...}`. During rebalancing, old shard version may lag behind new shard. Client reads from old shard, gets stale data (version 40 instead of 42). Mitigation: read from new shard if available, fall back to old shard (prioritize new). (3) Write routing: during rebalancing, both shards accept writes (dual-write protocol). Client sends write to both shard-5 and shard-5b. Both apply change. Requires idempotent writes (same write twice = no effect). (4) Monitoring: track write propagation lag. If lag >10 seconds, indicate rebalancing issue. Alert on lag. (5) Fallback: if rebalancing fails, rollback: redirect all traffic back to shard-5, keep shard-5b as replica. Expected rebalancing time: hours to days (depending on shard size). Cost: dual storage (data on both shards during rebalancing), dual writes (network overhead). Trade-off: consistency vs availability during rebalancing (you lose some availability during transition). Mitigation: schedule rebalancing during low-traffic window (nighttime, lower SLA impact).

Follow-up: During Phase 3 transition, you redirect writes from shard-5 to shard-5b. But there's a 5-second lag in data replication from shard-5 to shard-5b. A client reads from shard-5 (old), sees stale data, makes a write decision based on that stale data. What's the consistency issue?

You're building a collaborative document editor (like Google Docs). Multiple users can edit simultaneously. You need distributed locks to coordinate edits: when user A edits line 5, user B can't edit line 5 at the same time. But if every character edit requires a lock, performance is horrible (1000 edits/second = 1000 lock acquisitions, very high contention). How do you design this to be efficient?

Fine-grained locks (per-character) create contention. Better: use operational transformation (OT) or conflict-free replicated data types (CRDT) to avoid locks entirely. Architecture with CRDT: (1) No locks needed. Each user's edit is a separate operation (insert, delete). Operations are replicated to all clients/server. Each client maintains local copy of document. (2) Conflict resolution: use CRDT algorithm (e.g., Yjs, Automerge) to merge concurrent edits deterministically. If user A inserts "hello" at position 5 and user B inserts "world" at position 5 concurrently, CRDT uses vector clocks to order insertions, deterministic merge: "helloworld" or "worldhello" (consistent across all clients). (3) Latency: no lock acquisition = sub-millisecond latency per edit. 1000 edits/second handled easily. (4) Implementation: use library (Yjs for JavaScript, Automerge for general). Don't implement CRDT from scratch. (5) Storage: append-only log of operations. Each operation has unique ID (client_id + clock). Replaying log always produces same document (deterministic). (6) Monitoring: track operation replication lag. If lag >1 second, clients see stale document. Acceptable for docs (users tolerate 1s eventual consistency). Benchmark: Yjs handles 100K operations/second on single machine, scales to millions across cloud. Cost: CRDT library (open source) + replication infrastructure (~$2K/month for document store + replication). Expected result: low latency (<100ms per edit), high throughput (1000+ edits/sec per user), no lock contention. Trade: eventual consistency (immediate local, replicated to others in <100ms). Caveat: CRDT requires immutable operations (can't change edit retroactively). If user needs to "undo", that's a separate operation (undo) which is replicated.

Follow-up: Your CRDT-based editor has 1000 users editing a document. After 1 hour, the operation log grows to 10GB. Replaying all operations to a new client takes 30 seconds. How do you optimize this?

You're building a real-time collaborative whiteboard (like Miro or FigJam). Multiple users draw shapes simultaneously. When user A moves a rectangle while user B resizes it, you need both operations to succeed without conflicts or lost updates. You tried using a single distributed lock on the rectangle object, but lock contention makes the whiteboard feel sluggish (100ms latency for simple operations). How do you optimize for responsiveness?

Single lock on object is a bottleneck (all operations serialize). Better: use operation-level conflicts (not object-level locks). Approach: (1) Operational transformation (OT) or CRDT to merge concurrent edits. Each user's edit is an operation: `{user_id, operation_id, shape_id, action: "move", x: 100, y: 150, timestamp}`. Operations are replicated to all clients. (2) Conflict resolution: some operations commute (don't conflict). If user A moves rectangle and user B resizes it, these are independent (no lock needed). If both users move same shape simultaneously, use deterministic merge: "smallest user_id wins" (for move operations, use A's position). (3) Local immediacy + eventual consistency. User A's move is applied locally immediately (sub-millisecond), then sent to server + other clients. Other clients apply move with 100-200ms latency (network + processing). User gets instant feedback locally, consistency emerges globally. (4) Lock only on conflicts. If two users try to delete same shape simultaneously, that's a conflict—lock the deletion. Use optimistic locking: both attempt delete, only first succeeds, second gets "object already deleted" error (retries or shows toast: "Shape was deleted by another user"). (5) Batching: collect operations for 50ms (one render frame), send batch to server. Reduces network calls 20x. Server merges batch atomically. (6) Cost: CRDT library (Yjs) handles this automatically. No additional cost beyond library. Expected result: 100ms latency down to <50ms (local operations apply instantly). Users don't feel lock contention. Scale: 100 concurrent users, 1000 shapes, no bottleneck (all operations processed independently). Trade: eventual consistency (conflict resolution happens on server, client sees local copy first). Acceptable for drawing tools (users tolerate temporary inconsistency if they see immediate feedback).

Follow-up: User A draws a complex shape (100 points). Network is slow (1.5 Mbps). Operation payload is 5KB. Broadcasting to 100 other users takes 30 seconds. Meanwhile, user B edits the same shape. How do you handle operation ordering and causality?