System Design Interview — URL Shortener and Unique ID Generation

Your URL shortener handles 1B writes/day and 50B reads/day. At scale, you're generating short codes from a database sequence, but your database is now the bottleneck—the sequence lock is causing write latency to spike from 5ms to 200ms at peak times. Your architecture is a single primary. How do you fix this without a full rewrite?

The issue is that database sequences serialize writes—only one writer can increment at a time. This is a classic bottleneck at scale. Several solutions, from quick to robust: (1) Batch sequence allocation: instead of asking the DB for one ID per request, ask for a range (e.g., IDs 1000-1999). Your app caches this range in memory and allocates from it. When exhausted, fetch the next batch. This reduces DB hits by 100x. Implement as a pool of ranges in each API server, with background refresh threads. (2) Snowflake-like distributed ID generation: each API server is assigned a shard ID (0-1000), and generates IDs locally using timestamp + shard + sequence counter. No DB coordination needed. Trade-off: IDs are no longer strictly sequential, but they're sortable by time and unique. Implement: each server has an atomic counter that resets every millisecond. At 1000 RPS per server, you fit ~1M IDs per millisecond, so IDs don't collide. Test this under load. (3) If you must keep sequential IDs, use a dedicated ID service (not your main DB). Run a stateless service cluster that owns sequence generation. It doesn't need strong consistency—just append-only logs. Use Zookeeper or Etcd for distributed coordination—infinitely more scalable than a DB lock. (4) Redis: allocate ID ranges from Redis INCR (single-threaded but optimized for throughput—handles 500K ops/sec easily). Use Redis replication for high availability. The quick fix: option (1) works immediately and gets you 90% of the way. Deploy in 1-2 hours. For long-term: Snowflake IDs (option 2) are industry standard and what Twitter/Stripe use. Trade a tiny bit of ordering for massive scalability.

Follow-up: If you go with Snowflake IDs, how do you handle clock skew across servers? What happens if a server's clock moves backward?

Your URL shortener URLs have a 99.99% availability SLA. But you've discovered that 0.1% of short URLs are returning stale/incorrect long URLs—users get the wrong website. This only started happening 2 days ago. Root cause analysis shows it's a caching layer issue. Walk through diagnostics and fix.

Stale cache hits are typically caused by: (1) incorrect TTL logic, (2) cache invalidation bugs, (3) multiple database writers without coordination, or (4) race conditions during URL updates. Diagnostics: (1) Pull all incorrect URL hits from logs. Compare short codes with database values. If DB has the correct URL but cache returns wrong one, it's a cache staleness issue. (2) Check cache hit rates before/after the 2-day mark—did something change? Compare RPS, cache size, eviction rate. (3) Look for URL update patterns: did someone bulk-update URLs via API? Are there multiple services writing to the same short code (should be immutable, but bugs exist)? (4) Check your cache TTL: if you're using a 30-day TTL and the URL was changed, the cache won't update for 30 days. (5) Examine cache layers: Redis, CDN, browser cache? The staleness could be at any layer. Root cause is usually one of: (a) Missing cache invalidation: when you update a URL, you don't purge the cache. (b) Double-write race: a write goes to DB, then to cache, but in the wrong order—client reads cache before cache is updated. (c) Eventual consistency lag: you're using Redis replication and a replica was stale. For fixes (immediate): (1) Reduce TTL from 30 days to 5 minutes. This doesn't fix the bug but reduces blast radius—stale data expires faster. Deploy now. (2) Add explicit cache invalidation: when a URL is updated (rare, but possible for some users), delete from Redis immediately. Use cache keys: `short_code -> (long_url, version)`. On update, increment version. (3) Implement read-through validation: on cache hit, randomly verify the URL against DB (1% sampling). If mismatch, purge cache. (4) Long-term: switch to write-through cache where writes go to DB first, then cache. Never the reverse. For re-occurrence prevention: audit all URL mutation endpoints (update, delete), ensure they purge cache. Add alerts: if cache hit rate suddenly drops or variance in response times spikes, page oncall.

Follow-up: How would you detect this issue in staging before production? What tests would catch stale cache bugs?

Your product team wants to add custom short codes: users can request a vanity URL like "https://short.app/mycoupon" instead of random "https://short.app/a7k2j". How do you handle collisions? What about reserved words? Design the system to support both random and custom codes with zero conflicts.

This adds a policy layer on top of ID generation. Architecture: (1) Database schema: separate table for custom short codes with unique constraint on code. Add a column for owner (user_id). (2) Collision handling: when a user requests a custom code, check if it exists. If yes, reject with conflict response. To reduce rejections, suggest alternatives (append a number: mycoupon1, mycoupon2). (3) Reserved words: maintain a blocklist (admin, api, status, support, etc.). Before accepting a custom code, query this list. Store in Redis for fast lookup. (4) Rate limiting: users might abuse custom codes. Limit each user to 100 custom codes per day. Implement sliding window counter. (5) Namespace separation: allow custom codes only for premium users or specific use cases (marketing campaigns). Free users get random codes. (6) Lookup logic: when a short code is requested, check custom table first (indexed by code). If not found, check random table. This adds ~2-3ms latency but keeps code allocation simple. (7) Expiration: custom codes might need TTL (e.g., campaign codes expire after 30 days). Add created_at and expires_at. During lookup, check expiration. (8) Analytics: track custom code usage separately. They're high-value (intentional), so provide detailed reports. Implement as: custom_clicks table with timestamp, referer, device. (9) Scarcity handling: if you're worried about running out of 6-character codes, you have ~2B combinations—with 1B writes/day, you're good for ~2000 days. But implement monitoring. For collision detection: use a bloom filter or Redis set (100 MB) to quickly reject obviously taken codes before hitting DB. Test edge cases: what if custom code matches a random code that was generated before? Unlikely, but ensure your lookup doesn't double-count. The key is: custom codes are a UI feature layered on top of a robust random ID system. Don't let it compromise the core.

Follow-up: How would you handle the case where a user wants to reclaim a custom code they released (e.g., update a campaign URL)? Are there any issues?

Your analytics team reports that the analytics for URL clicks are wildly inaccurate—some URLs show 5K clicks in your dashboard but database audit logs show 50K actual clicks. Investigation shows duplicate click logs are being stored. How did this happen, and how do you prevent it?

Duplicate click logs happen when: (1) client retries requests without idempotency, (2) server doesn't deduplicate before logging, (3) event streaming platform (Kafka) has duplicate messages, or (4) async queue workers re-process messages. Diagnostics: (1) Pull duplicate click records—what do they have in common? Same timestamp? Same click_id? Same user_session? (2) Check client logic: is the browser sending multiple requests for one click? Inspect network tab. (3) Check server request handlers: do they validate idempotency tokens? If client sends the same click twice with the same request ID, does the server deduplicate? (4) Check Kafka: are consumer groups committing offsets correctly? If a worker crashes after logging but before committing, Kafka redelivers the message. (5) Check async jobs: if clicks are logged in real-time but aggregated later by batch jobs, are they deduplicating? Root causes in order of likelihood: (a) No idempotency—client retries, server processes both. (b) Kafka rebalancing—workers crash, messages reprocessed. (c) Double-write in async pipeline—click logged twice (once realtime, once batch). For immediate fix: (1) Implement idempotency tokens. Client generates a UUID for each click (click_id). Server checks: before logging, query a dedup table keyed on click_id. If exists, return cached response. If not, log and cache. Use Redis for low-latency dedup lookups (24-hour TTL). (2) Add unique constraint on database: CREATE UNIQUE INDEX click_dedup ON clicks(click_id) (or use a separate dedup table). This ensures DB-level protection. (3) Kafka exactly-once: enable transactional writes in consumers. Write click log and offset commit in the same transaction. If worker crashes between them, Kafka will replay. (4) Audit: add a run_id to each batch aggregation job. Before re-running a job, check if run_id exists in output. If yes, skip. Long-term: event sourcing. Don't log raw clicks; append to an event log with dedup ID. Aggregations read from the log, ensuring they're always consistent. This is more robust but requires weeks to implement.

Follow-up: If you're using Redis for dedup, what's your fallback if Redis is unavailable? Can you still accept clicks?

Your company acquires a competitor with 10B URLs already shortened. You need to merge their database into yours—no conflicts, maintain both shortener codes (theirs and yours), and users from both should be able to access all URLs. How do you execute this merge with zero downtime?

This is a complex data migration with zero downtime. Architecture: (1) Parallel systems during migration window: competitor's URLs are served from their shard (read-only mode). Your URLs from your shard. On lookup, try your shard first; if not found, try competitor's shard. (2) Setup: deploy a routing layer (edge service or API gateway) that can route to multiple backends based on URL origin. (3) Migration: background job runs 24/7 copying competitor's URLs into your database. Don't do it all at once—batch 100K URLs at a time, sleeping between batches to avoid DB strain. (4) Collision handling: assign each company a namespace prefix. Your codes: 0-9, a-z, A-Z (62 chars). Competitor's codes: prefixed with a marker (e.g., "c_" + their code). Or use a URL versioning scheme: v1 (yours), v2 (theirs). Lookup knows to try both. (5) DNS/routing: during merger, short URLs should redirect through your infrastructure. Competitor's shard remains read-only. Update DNS to point all traffic to your system. (6) Deduplication: if both systems have shortened the same long URL, consolidate. Map all duplicate long URLs to a single short code (prefer your code, delete theirs). Use a fingerprint (hash of long URL) to detect duplicates. (7) Analytics merge: combine click counts. If competitor's URL has 1M clicks and yours has 500K, merge to show 1.5M. Use a migration version field to avoid double-counting during phased rollout. (8) Validation: before completing migration, audit: (a) all competitor URLs accessible via your system, (b) no missing clicks, (c) redirects working, (d) analytics consistent. Run for 1 week in parallel mode. (9) Cutover: set competitor's shard to read-only, then gradually route 100% traffic to your system. Keep competitor's shard as backup for 30 days. Rollback: if critical issues arise, route back to parallel mode. Estimated timeline: 2 weeks of parallel mode, then 1-week cutover. Key: never delete competitor's data until you're 100% confident in your system. Keep immutable backups.

Follow-up: During the parallel mode, how do you handle updates or new URLs? Can competitors' users still shorten new URLs?

Your URL shortener database is 500 GB and growing 1 GB/day. Queries for click lookups are starting to slow down despite indexes. Your main query: "get total clicks for a short code" scans millions of rows. You're running out of disk space in 6 months. Design a solution for write amplification, query performance, and storage.

This is a classic analytics at scale problem. Three issues: (1) raw click data is too large (1B clicks/day = huge write volume), (2) queries have to scan massive tables, (3) disk is filling up. Solutions: (1) Time-series database for clicks: instead of raw click records (timestamp, user_id, referer, device), use a time-series DB (InfluxDB, Prometheus, TimescaleDB). Compress click data by time bucket (1-hour buckets). Store: (short_code, bucket_hour, click_count, unique_users). This reduces storage by 100x (1B raw clicks = 1B rows; 1 hour bucket = 1M rows per hour). InfluxDB has built-in compression. (2) Materialized aggregations: every hour, run a batch job: group clicks by short_code, compute total, unique users, top referrers, devices. Store in an aggregation table. User queries hit this table, not raw clicks. This is 1000x faster. (3) Retention policy: raw clicks expire after 7 days (enough for debugging). Hourly aggregates kept for 1 year. Yearly summaries kept indefinitely. (4) Sharding: split click data by short_code hash. If short_code hashes to 0-100K, data goes to shard 1; 100K-200K to shard 2. Each shard is a separate database. Queries parallelize across shards. (5) Archive: old data (> 1 year) moves to cold storage (S3 Parquet). Load on demand for historical reports. Cost: ~$0.01 per GB/month vs $10/month for hot storage. (6) Indexing: on the aggregates table, index by (short_code, bucket_hour). Queries are sub-100ms. For raw clicks, don't index—they're append-only logs. Query with Presto/Trino (columnar scan). Implementation: (1) Deploy InfluxDB or TimescaleDB cluster (3 nodes for HA). (2) Update click-logging service to send to time-series DB instead of relational DB. (3) Run backfill job: copy last 7 days of PostgreSQL clicks to time-series DB. Test queries match. (4) Update analytics dashboard to query time-series DB. (5) Monitor: ensure time-series DB is healthy, compression ratios are as expected, and query latency is <100ms. Estimated savings: 500 GB → 5 GB (raw + aggregates). Disk problem solved for 1+ year.

Follow-up: If a user requests analytics for a very old time range (1+ year), how do you efficiently query archived data in S3?

You're building a new product: scheduled URL expiration (short URLs auto-delete after N days). You need to implement cleanup without scanning millions of rows every night. Design an efficient TTL system.

TTL systems at scale require careful design. Naive approach (full table scan nightly) doesn't work beyond 100M records. Better approaches: (1) Redis sorted sets: use Redis's built-in TTL. When a short URL is created, set it in Redis with expiration timestamp as score. Redis periodically expires old entries. Store the actual URL data in PostgreSQL, but use Redis for TTL checks. On lookup: check Redis first (does key exist?). If yes, return from DB. If no, it's expired. (2) Lazy expiration: don't proactively delete. On each lookup, check created_at + ttl_days. If expired, return 404 and soft-delete (mark deleted = true). This is cheaper—only one query per hit. (3) Background cleanup (async): every hour, run: DELETE FROM urls WHERE created_at + interval 'N days' < now() AND deleted = false LIMIT 100K. Use pagination to avoid locking the table. Run on a replica if possible. (4) Partitioning by TTL: separate tables for different TTLs (permanent, 7-day, 30-day). Insert into the right table. Every day, drop old partitions (drop table 7_day_20260407). This is orders of magnitude faster than deleting rows. PostgreSQL table partitioning supports this. (5) Event-based: when creating a URL with 30-day TTL, enqueue an event for 30 days later. Use a task queue (e.g., Celery, Bull) that delivers the event at the scheduled time. Worker deletes the URL then. This is reliable and exact. Disadvantage: requires a robust queue. (6) Hybrid: use Redis + lazy expiration. Redis stores short TTL URLs (< 7 days), PostgreSQL stores long TTL URLs (> 7 days). For short TTL, Redis handles expiration automatically. For long TTL, lazy expiration on read. Best approach: partitioning by TTL bucket (part 1: permanent, part 2: 7-day, part 3: 30-day, etc.). Every day at midnight, drop the oldest partition and create a new one. This is O(1) and doesn't touch the active data. Estimated cost: nearly zero, since you're just dropping a partition (metadata operation), not scanning/deleting rows. Test: verify that expired URLs return 404, analytics for expired URLs are still accurate (they're soft-deleted), and cleanup doesn't cause DB locks during business hours.

Follow-up: If a user wants to extend the TTL of a URL (e.g., keep it alive for another 30 days), how do you handle that without data fragmentation?

Your URL shortener has a 99.9% availability SLA but you're running a single PostgreSQL primary with one read replica. The primary fails catastrophically (corrupted WAL file, hardware failure). Recovery takes 2 hours. You've breached SLA. Design a multi-region, active-active architecture to prevent this.

Single primary is a single point of failure. For 99.9% SLA (43.2 minutes downtime/month), you need: automatic failover (< 1 minute), and ideally active-active reads. Architecture: (1) Multi-region setup: PostgreSQL primary in US-East, async replica in US-West (for durability), and a separate PostgreSQL instance in EU (hot standby for failover). Use streaming replication with synchronous mode for critical writes (short code creation), async for less critical (analytics). (2) Failover automation: use Patroni (or similar HA tool) to manage failover. It detects primary failure, elects a new primary from replicas in < 30 seconds. Application updates DNS (or uses a connection pool that detects failover and reconnects). (3) Geo-routing: users in US route to US-East primary. EU users route to EU instance. This spreads load and reduces latency. Both can serve reads; only primary accepts writes. (4) Write coordination: writes (short code creation) go to the current primary. Use a circuit breaker: if primary is unreachable, fail fast and queue the write locally (in-memory or Redis). Retry periodically. (5) Consensus for dual writes: if you want true active-active (both regions accept writes), use a consensus protocol. PostgreSQL logical replication can handle bidirectional replication, but conflicts must be resolved. This is complex; simpler is to designate one region as primary, others as read replicas. If primary fails, one replica is promoted. (6) Testing failover: chaos engineer: kill the primary weekly, verify failover works, measure recovery time. Ensure < 1 minute. (7) Monitoring: alert on replication lag (> 1 second), failed primaries, connectivity issues. Response SLA: < 5 minutes to page on-call. (8) Backup: daily snapshots to S3 (or cloud equivalent). If catastrophic loss, restore from backup in 30 minutes. Implementation: (a) Set up 3 PostgreSQL instances (US-East primary, US-West replica, EU replica). (b) Deploy Patroni cluster manager. (c) Update application connection strings to use Patroni DNS endpoint, which always points to current primary. (d) Run disaster recovery drill monthly. This architecture achieves 99.99% uptime (4.3 minutes downtime/year) because: primary failure is < 1 minute (automatic failover), and secondary failures don't cause outage (replicas take over).

Follow-up: If you have bidirectional replication between US and EU, how do you handle write conflicts when two regions simultaneously create short codes?