System Design Interview — Microservices Decomposition and Boundaries

You have 200 engineers on a 5-year-old monolith (1.2M LOC, Node.js). Deployment is slow (30 minutes), test suite takes 20 minutes. Teams frequently block each other: payments team and orders team share the same database schema, and schema migrations take 2 days to plan. Leadership wants microservices. Where do you draw the first boundaries, and what's your rollout plan?

Don't start with perfect microservices boundaries; start with organizational alignment and pain points. Your bottlenecks: (1) schema coupling (payments-orders), (2) deployment bottleneck (monolith deploy). Strategy—Phase 1 (Weeks 1-4): Extract Payments as first service. Why: payments has clear API (charge, refund, reconciliation), owns distinct schema, independent of other teams. Extract to Node.js service using strangler fig pattern: new payment requests route to service via API gateway, existing requests continue hitting monolith. Monolith still handles order creation, but calls payment-service synchronously. This gives you isolated deployment (payments deploys independently, 5 min vs 30 min). Phase 2 (Weeks 5-12): Extract Orders service. Now Orders and Payments are separate; orders call payments-service for charges. Phase 3 (Weeks 13+): Extract User, Inventory, Analytics. Boundaries: (1) Orders service owns order state machine (created, paid, shipped), order database (PostgreSQL shard per region). (2) Payments service owns payment records, reconciliation, refunds. (3) Inventory service owns stock levels, reservations. (4) User service owns profiles, authentication. Each service has its own database (no cross-service joins). Communication: synchronous RPC (gRPC) for critical paths (order→payment→inventory), async events (Kafka) for async workflows (order→notification, order→analytics). Cost: infrastructure up 15% (separate services, databases, monitoring). Operational: 3-5 services can be managed by one platform team initially. Trade-off: you lose some consistency (now eventual consistency between services), but gain deployment velocity. Expected benefit: deployment time down to 5 min, test time per service down to 3 min (parallelizable). Risk: distributed tracing and debugging complexity increases 3x—budget for observability (Datadog/New Relic).

Follow-up: Payments service and Inventory service both need to update during order creation (charge payment, deduct stock). If inventory deduction fails after payment succeeds, how do you handle this transaction without a distributed transaction coordinator?

Your company is a B2B SaaS platform with 500 customers, multi-tenant architecture. Your monolith has modules: tenant management, billing, API gateway, customer admin panel, data pipeline. 50 engineers. Latency SLA: admin operations (tier changes, user management) should respond in <500ms. Data pipeline runs daily, 4-hour batch jobs. API gateway needs 99.99% uptime. Which modules should be separate services, and why?

Use SLA requirements and failure isolation as boundaries. Your services should be: (1) API Gateway (dedicated). Reason: 99.99% SLA is non-negotiable; if admin panel crashes, don't take down production API. Separate service lets you scale independently (100s of RPS for API, 10s for admin). One engineer manages gateway. (2) Billing (dedicated). Reason: billing logic is regulated (audit trails), has independent scaling (runs once/month = low-traffic but high-criticality), needs reconciliation workflows. (3) Data Pipeline (dedicated batch job, not a service). Run as managed job (AWS Step Functions, Kubernetes CronJob). Reason: batch workloads shouldn't share resources with online services. (4) Tenant Management + Admin Panel (keep together initially). Reason: low traffic (admin usage, not user-facing), tightly coupled (admin creates tenants, billing references tenants). Keep as monolith service until one team naturally owns 10+ engineers. Expected layout: (a) API Gateway Service (2 engineers). (b) Billing Service (2 engineers). (c) Tenant + Admin monolith service (3 engineers). (d) Data Pipeline jobs (1 engineer). This scales to 8 engineers cleanly without over-engineering. Infrastructure: 3 services × 2 deployment environments (staging, prod) = 6 containers. Load balancers: 1 for API (99.99% uptime SLA = multi-AZ, health checks every 5s). Cost: extra monitoring (~$200/month), dedicated databases ($500/month). Expected benefit: billing deploys independently (daily deploy windows), API has clear SLA, admin outages don't affect customers. Risk: data consistency across services (if tenant creation fails partway through billing setup). Mitigation: use saga pattern—tenant creation is a workflow (create-in-DB → publish event → billing-service subscribes → creates billing profile). If billing fails, compensate (rollback tenant).

Follow-up: Your billing service needs to know the customer's tier to calculate monthly invoice. Tenant service owns the tier field. How do you query it without creating runtime dependency?

You run a social media platform: 100M monthly users, feeds, messaging, recommendations. You're currently monolithic. You want to migrate to microservices to scale independently (feed service sees 10K RPS, messaging sees 5K RPS). Your challenge: feeds and messaging both need user profiles (name, avatar). How do you prevent coupling between feed and messaging services when both query user data?

This is a classical domain boundary problem. User profile is shared data, not a service. Your options: (1) Replicate user data in both services (denormalization). Feed service stores user cache (Redis, expires 1 hour): `user_id → {name, avatar, bio}`. Messaging service replicates same cache. When user updates profile, publish event to both services (invalidate cache). Pro: zero network calls, 10μs lookup. Con: eventual consistency (user changes avatar, takes 1-2 seconds to propagate). Works if users tolerate 2-second staleness. (2) Create dedicated User service (microservice). Feed and Messaging call User service for profile data on-demand. Feed query becomes: (a) get feed IDs from cache (10ms), (b) hydrate user data via User service (50ms per user), (c) return to client (60ms total). Downside: +1 network hop per request, hits User service at 10K * 2 users/feed = 20K RPS (user service must be highly available). Mitigate with local caching: feed service maintains 1-hour TTL cache of user profiles. (3) Event-driven hybrid (best). User updates publish to Kafka. Both Feed and Messaging services subscribe to user events, locally cache profile. Feed and Messaging serve requests from cache, no User service call. Cache miss → async update via event listener. Pro: both services can handle traffic independently, no synchronous call between services. Con: slightly stale data (event lag ~100ms). At 100M users, this is the standard pattern. Implementation: (a) User service publishes `UserUpdated` events (Kafka topic: `user-events`). (b) Feed service subscribes, updates local Redis cache. (c) Feed service's user profile endpoint returns cached value or async-updates if not found. (d) Same for Messaging. This scales to 100K RPS per service without bottleneck. Cost: Kafka cluster (~$2K/month), Redis instances per service (~$500/month each, 2 services = $1K), User service as lightweight writer (~$1.5K/month). Total: ~$4.5K/month. Alternative (simpler, not scalable to 100M): single User service with heavy caching and read replicas. Works to ~10M users.

Follow-up: Your user profile cache in Feed service is now 200GB. Booting a new Feed instance takes 45 minutes to warm cache. How do you make cache bootstrap faster?

Your company sells IoT devices. You have a backend with ~15 services: device management, telemetry ingestion, alerts, configuration, over-the-air updates, customer portal. 60 engineers split across teams. Some services have not clearly bounded—multiple teams accidentally own overlapping logic (configuration has 3 implementations). You're bleeding $50K/month on cloud costs. Where's the rot, and how do you reorganize?

15 services for 60 engineers is over-microserviced (Amazon's "two-pizza rule": 1 service ≈ 6-8 engineers). The rot: (1) No clear domain boundaries (configuration has 3 implementations = duplication). (2) Cost from redundant infrastructure (15 services = 15 load balancers, 15 monitoring dashboards, 15 deployment pipelines). (3) Operational sprawl (debugging requires coordinating across 15 services). Your fix: consolidate to 6-8 core services aligned with teams. Strategy: (a) Map services to business domains (not technical layers): Device domain owns device management + configuration + OTA updates (device state machine is a natural boundary). Alert domain owns alert logic. Telemetry domain owns ingestion + analytics. Portal domain owns web UI. (b) Consolidate configuration: pick strongest implementation, migrate other 2 implementations' logic into it, deprecate redundant services. (c) Reorganize teams (not systems): assign Device team (10 engineers) to own Device domain service. Alert team (5 engineers) owns Alert service. Telemetry team (8 engineers) owns Telemetry service. Portal team (6 engineers) owns Portal. Reserve 6 engineers for infra/platform. Now services align with team ownership (no more accidental duplication). (d) Merge lightweight services: if a service is <1000 LOC and has <100 RPS, fold it into a neighboring service. For example, Over-the-Air Updates likely <1000 LOC → merge into Device service. Cost reduction: consolidation from 15 to 7 services saves ~$20K/month (fewer load balancers, reduced monitoring overhead, shared databases where appropriate). Latency: Device service handles both <500ms OTA queries and slower configuration updates via separate thread pools (async queues for slow tasks). Risk: Device service becomes large (100K LOC) → mitigate with clear module boundaries within the service (device.go, config.go, ota.go packages). Expected timeline: 6 weeks to consolidate safely (migrate data, update consumers incrementally).

Follow-up: After consolidation, Device service has 3 separate databases (devices, config, OTA state). A customer's OTA update fails because of a config mismatch. How do you guarantee consistency across 3 databases during OTA?

Your organization has grown from 50 to 500 engineers. You started with a monolith, migrated to microservices 2 years ago. Now you have 80+ services. Teams don't know what other teams' services do. Deployment is slower (waiting for dependent services). You've lost organizational velocity. How do you decide if you should consolidate back to fewer services or re-engineer for better autonomy?

This is Conway's Law in action: org structure maps to system architecture. 500 engineers suggests ~60-70 independent teams (8-person squads). 80+ services means some services are under-owned or redundant. Diagnosis first: (a) Count active services (deployed weekly). If <50 of 80 are actively deployed, consolidate the rest. (b) Measure inter-service dependencies (how many services does a typical request call?). If average is >3, you have too much coupling—services are too fine-grained. (c) Ask: could any service be owned by one engineer? If yes, it's too small. Decision tree: (1) If inter-service dependencies are high (>3) AND services are small (<1 team): CONSOLIDATE. Merge into bounded contexts. Target: 60-70 services for 500 engineers (1 service per team, some teams own 2). (2) If inter-service dependencies are low AND services are well-owned: ASYNC-IFY. The problem isn't the number of services; it's synchronous coupling. Move from RPC to events (Kafka). Orders call Payments via RPC (blocking) → Orders publish OrderCreated event, Payments listens asynchronously. This breaks coupling, allows independent deployment. (3) If neither (random chaos): REORGANIZE. Implement platform team (10 engineers) to own shared infrastructure: API gateway, auth, observability, deployment pipeline. Assign business teams to own end-to-end domains (Order Management team owns Order, Payment, Fulfillment services). This creates clear ownership and reduces architectural chaos. For your case (500 engineers, 80 services, slower deployments): I'd guess inter-service coupling is the issue. Action: (a) Map dependency graph (which services call which). (b) Identify high-degree nodes (services called by 10+ others). These are bottlenecks—extract to async patterns or consolidate. (c) Consolidate services with <30 RPS to their caller (unless they have strict SLA). This drops 80 → 50 services. (d) Create platform team to enforce deployment parallelization (services deploy only if dependencies are met, not blocking on other teams). Expected outcome: deployment time 30 min → 10 min (parallelized), teams unblocked from each other. Cost: platform team (~$200K/year), but velocity up 3x (worth it).

Follow-up: Your Order service now publishes OrderCreated events to Kafka. Payment service subscribes. If Payment service crashes before processing the event, how do you prevent order from being stuck without payment?

You're designing a new product line: an internal tools platform. You'll have maybe 30 engineers working on 10 different tools (documentation, deployment, monitoring dashboards, etc.). You want to avoid the complexity of your consumer platform (80+ services). How do you structure this as a monolith, but keep it decomposable for future growth?

Start with modular monolith (one codebase, multiple modules, clear boundaries). This avoids microservices operational overhead while keeping the system decomposable. Architecture: (a) Single Git repository, organized as: `tools/docs/`, `tools/deploy/`, `tools/monitoring/`, `tools/auth/`. (b) Each tool is a separate module: own database schema (separate tables, no cross-module joins), own API routes (`/api/docs/*`, `/api/deploy/*`), own frontend pages. (c) Shared layer: auth, logging, observability (shared across all tools). (d) Each module has clear module boundaries (Go packages with no circular imports, Node.js with strict eslint rules). Deployment: single codebase builds to one Docker image, deploys as one unit. Database: shared PostgreSQL, but tables are namespaced per tool (`docs_articles`, `deploy_runs`, etc.). Trade: you can't scale Documentation independently from Deployment if one gets hit with traffic. But for internal tools (low traffic), this is fine. If a specific tool grows: extraction is straightforward—move `tools/docs/` to separate service, add gRPC endpoint for cross-tool queries. Growth path: 30 engineers → modular monolith (5-6 logical teams). 100 engineers → extract high-traffic tools to services (Deployment tool if heavily used, keep low-traffic as monolith). Cost: one deployment pipeline, one ops team, ~$3K/month infrastructure (shared database, shared compute). If you extracted to microservices from day 1: $10K+/month (separate infrastructure per tool). Risk: as codebase grows past 500K LOC, compilation and test time increase. Mitigation: build tooling (Bazel, Turborepo) to parallelize builds, keep monolith focused (only 10 tools, not 80).

Follow-up: Your monolith now has 200K LOC across 8 tools. Documentation team wants to deploy twice a day, but Deployment tool is very stable (deploy once a month). How do you enable Documentation's velocity without risking Deployment?

You have a mobile app (iOS, Android, web) communicating with an API backend. The API was monolithic, and you're redesigning it into services. But mobile teams want to stay decoupled from backend service boundaries (they don't want to know about Orders, Payments, Inventory separately). They want one unified API (BFF—Backend For Frontend) that abstracts away backend complexity. How do you structure this, and what are the tradeoffs?

Use BFF (Backend For Frontend) pattern: add an API layer (Mobile BFF service) that sits between mobile clients and backend microservices. Mobile BFF orchestrates calls to Orders, Payments, Inventory services, presents a unified API surface to mobile. Architecture: (a) Mobile clients call only Mobile BFF (single endpoint, unified contract). Mobile BFF is responsible for orchestration: fetch order details → call Orders service → call Payments service to get invoice → combine response. (b) Backend services (Orders, Payments, Inventory) expose internal gRPC APIs (not public REST). Only Mobile BFF and other internal tools call them. (c) API versioning: Mobile BFF owns API version (v1, v2). When you add a new backend service, Mobile BFF updates without changing API version (backward compatible). Pro: mobile teams stable, can release independently from backend. Con: Mobile BFF becomes a bottleneck (all mobile traffic flows through it). For 100K concurrent users, this requires scaling (multiple instances, load balancer). Cost: additional service ($2K/month), but wins independence. Scale: Mobile BFF handles 1-2K RPS per instance. If mobile traffic is 10K RPS, deploy 5-10 instances. Alternative (no BFF): mobile clients call backend services directly. Pro: fewer hops, simpler. Con: client logic breaks if backend services change boundaries (Orders service splits into Orders + OrderStatus). Your choice depends on team org: if mobile team is separate from backend team, BFF is worth it (clear ownership, decoupled deployments). If single team owns both, skip BFF (premature complexity). For most orgs at 500 engineers: BFF makes sense because mobile and backend teams are separate. Implementation: BFF in Node.js/Go/Rust (stateless, auto-scales). Cache layer (Redis) for frequently-accessed data (user profiles, product catalogs, real-time status). Timeout all backend calls at 5-second SLA (if Orders service is slow, mobile requests timeout gracefully rather than cascade).

Follow-up: Your Mobile BFF times out calling Orders service (P99 latency 8 seconds). Mobile requests are stalled. How do you handle this without breaking mobile user experience?

You're operating a multi-tenant SaaS where each tenant has their own database (tenant-per-database model). You have 500 tenants, and ops is complex: schema migrations must run on 500 databases, monitoring 500 instances is hard, and scaling heterogeneously (one tenant needs 10x resources) is awkward. You're considering consolidating to a single database per region with logical isolation. What are the tradeoffs?

This is a fundamental architecture trade-off. Tenant-per-DB (pros): maximum isolation (data and performance), tenant outage doesn't affect others, easy scaling per tenant (just spin new DB), regulatory compliance simpler (data residency). Cons: operational overhead (500 schema migrations), resource waste (many small DBs with fixed overhead), higher cost (500 DBs cost more than 5 shared DBs). Single-DB (pros): operational simplicity (1 migration), resource efficiency (data packed together, bulk operations faster), lower cost (1 DB cluster, 10x cheaper). Cons: noisy neighbor (slow tenant can starve others), potential data leaks if row-level-security (RLS) fails, harder per-tenant scaling (can't allocate specific resources). For 500 tenants, evaluate: (1) Blast radius: if one tenant corrupts data, does it affect others? Single-DB = higher risk. Mitigate with proper RLS + audit logs. (2) Compliance: does each tenant need legal data segregation? Tenant-per-DB is simpler (data is literally separate). Single-DB requires legal review (RLS might not suffice). (3) Scaling: do tenants have wildly different resource needs (biggest tenant 100x smallest)? Tenant-per-DB wins. If relatively uniform, single-DB works. (4) Migration: if you're currently at tenant-per-DB and want to consolidate, this is risky (data migration, schema conflicts, RLS integration). Budget 3 months. My recommendation: hybrid approach. Small/medium tenants (450) in shared DB with RLS. Enterprise/high-priority tenants (50) in dedicated DBs. This gives you 95% operational efficiency gains while keeping enterprise customer isolation. Cost: 50 DBs (small clusters) + 1 large shared DB = $5K/month. Vs 500 tenant DBs = $50K/month. 10x savings. Implementation: (a) Shared DB schema: add tenant_id column to all tables. (b) Row-level security: PostgreSQL RLS policies enforce SELECT/INSERT/UPDATE only for current tenant. (c) Enterprise DBs: dedicated clusters for tier-1 tenants. (d) Monitoring: per-tenant metrics (CPU, queries) across all DBs. Expected outcome: 90% cost reduction, 50% ops time savings, acceptable risk profile.

Follow-up: Your largest tenant (10% of revenue) needs their data deleted for GDPR compliance. In the shared DB, you delete their rows. But you later discover a backup from 2 days ago still has their data. How do you ensure permanent deletion across all backups?