AWS Interview — CloudWatch, X-Ray, and Observability

Your microservice latency spiked from 50ms P99 to 500ms P99 overnight. CloudWatch metrics show CPU 80%, memory 60%, no errors. You're blinded: which service is slow? Is it Lambda cold starts, RDS query latency, or something else? Walk through your troubleshooting steps using X-Ray.

Troubleshoot latency spike using X-Ray (vs CloudWatch alone): (1) Enable X-Ray on all services immediately (if not already enabled). Requires: (a) X-Ray SDK (auto-patch via AWS Lambda Powertools or explicit SDK calls). (b) X-Ray write permissions in IAM (lambda:PutTraceSegments). (2) X-Ray service map (first 30 seconds): (a) Open X-Ray Console → Service Map. Visual graph: API Gateway → Lambda → RDS → S3 (example). (b) Color code: red = errors, yellow = throttled/slow, green = healthy. Immediately identify which service(s) are yellow/red. Example: RDS is red. (3) Deep dive: X-Ray → Traces (last 5 min of traces): (a) Segment breakdown shows per-service timing: API Gateway (2ms) → Lambda (480ms) → RDS (420ms) → Cache (5ms). (b) Root cause: RDS is 420ms (8x normal). Next step: why? (4) RDS investigation: (a) RDS metrics in CloudWatch: query latency jumped from 5ms to 50ms. CPU 95% (database overloaded, not application). (b) RDS Insights (CloudWatch Logs): query patterns → identify slow queries. Example: "SELECT * FROM large_table WHERE status='pending' (no index)" scanning 10M rows. (c) This query is now being called 100x per request (regression in application code). (5) Root cause: (a) Code deployment last night introduced N+1 query bug (loop over 100 customers, 1 query each = 100 queries). (b) RDS can't keep up, query queues, latency builds. (6) Fix (immediate): (a) Revert deployment (1 min). Latency drops back to 50ms. (b) Or: patch code to batch queries (5 min dev). (7) X-Ray advantage vs CloudWatch alone: (a) CloudWatch shows: "System load 80%". Doesn't explain why (could be compute, I/O, memory, network). (b) X-Ray shows: service breakdown with exact timings. Narrows root cause to RDS in 60 sec instead of 30 min debugging. (c) Trace visualization: see full request flow with latency per segment. Invaluable for microservices.

Follow-up: X-Ray is enabled but sampling is set to 1% (log 1 in 100 requests). The slow query happened at 2:30 PM, but by 3:00 PM when you checked X-Ray, no trace of the slow request (wasn't in the 1% sample). How do you ensure X-Ray captures critical latency?

Your application uses Lambda, RDS, DynamoDB, and S3. CloudWatch shows: Lambda P99 latency = 200ms, but customers report "my page loads in 3 seconds." The disconnect suggests front-end or network issues. But you want to see full end-to-end latency (browser → API Gateway → Lambda → RDS → response). How do you trace this?

End-to-end latency tracing across browser + backend: (1) Backend instrumentation (X-Ray + CloudWatch): (a) Lambda: X-Ray automatically captures Lambda execution (200ms). (b) RDS: add X-Ray subsegment for RDS queries (shows db queries in trace). (c) API Gateway: enable CloudWatch Logs + X-Ray to capture API latency. (2) Front-end timing (browser JavaScript): (a) Use Navigation Timing API (browser-native, no library needed): ```javascript const perfData = performance.timing; const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart; // total 3 sec console.log({ navigationStart: perfData.navigationStart, domContentLoaded: perfData.domContentLoaded - perfData.navigationStart, loadComplete: pageLoadTime, apiCallStart: perfData.fetchStart - perfData.navigationStart, apiCallEnd: perfData.responseEnd - perfData.navigationStart }); ``` (b) Breakdown: navigationStart (0ms) → domContentLoaded (800ms) → loadComplete (3000ms). Show: 800ms for HTML, then 2.2 sec for JavaScript/CSS/images. (c) Identify where 2.2 sec goes: API call (200ms) + render (1sec) + image load (1sec). (3) API timing breakdown: (a) Request: 5ms (network latency browser → API GW). (b) API GW: 10ms. (c) Lambda cold start: 50ms (suspected culprit). (d) Lambda execution: 100ms (query logic). (e) RDS: 20ms. (f) Response: 5ms. Total: 190ms backend (matches X-Ray). But browser says 3 sec, so frontend is the issue. (4) Front-end profiling: (a) Image loading: 1sec wasted on unoptimized PNG (5MB). Compress to 500KB (100ms instead). (b) JavaScript bundle: 1sec to parse/execute 2MB minified bundle. Tree-shake unused code (reduce 30%, saves 300ms). (c) CSS: render-blocking. 600ms before first contentful paint. Move CSS to async, use critical inline CSS (saves 400ms). (d) Combined: 2.2 sec → 0.5 sec (80% faster). (5) Monitoring: (a) Install monitoring library (DataDog, New Relic, or DIY): send browser timing to backend CloudWatch. (b) CloudWatch dashboard: API latency (200ms) vs Total page load (3 sec). Visualizes front-end gap. (c) Alerts: if page load >2 sec (alert: likely front-end regression). (6) Root cause summary: backend (X-Ray) = 200ms, front-end (browser timings) = 2.8 sec. 93% of latency is front-end (not backend). Fix image + bundle + CSS, not Lambda. Implementation: 2 weeks (profile, optimize, test). Cost: image CDN (CloudFront) + lazy loading library + bundle analyzer = $50/month.

Follow-up: You optimized images and JavaScript. Page load now 1.2 sec. But X-Ray shows Lambda cold start is still 50ms every 2-3 requests. Customers complain of unpredictable latency (spiky). How do you eliminate Lambda cold starts?

You enabled X-Ray, but traces show 5,000+ segments per request (too much data). X-Ray storage costs explode to $5K/month (AWS charges per segment stored). CloudWatch Logs show the same information but cheaper. Should you stick with X-Ray or switch to CloudWatch Logs for cost?

X-Ray vs CloudWatch Logs trade-off analysis: (1) Cost comparison: (a) X-Ray: $5.00 per million recorded segments. At 5,000 segments/request × 100K requests/day × 30 days = 15B segments/month = $75K/month. Way too high. (b) CloudWatch Logs: $0.50 per GB ingested. If each trace = 1KB, 100K req/day × 30 days × 1KB = 3GB/month = $1.50/month. 50,000x cheaper. (2) Root cause of 5K segments per request: (a) Over-instrumentation. X-Ray captures every function call, SQL query, HTTP call. Standard microservice = 50-200 segments per request. 5K suggests: (i) Recursive function tracing (each call = segment). (ii) Database loop (100 queries = 100 segments). (iii) Nested service calls (too many intermediate layers). (b) Fix: reduce instrumentation. Use X-Ray sampling (e.g., 1% of requests = capture 100K req/day × 0.01 = 1K req/day with 5K segments = 5M segments/month = $25/month). (3) Hybrid approach (recommended): (a) X-Ray sampling: 1-5% of traffic for latency profiling (statistical visibility). Cost: $25-100/month. (b) CloudWatch Logs: structured logging for 100% of traffic. Cost: $2-5/month. (c) Correlation: link X-Ray trace IDs to CloudWatch Logs. When investigating incident, drill from X-Ray trace → detailed CloudWatch Logs. (4) Decision tree: (a) If you need latency profiling + service map (visualize dependencies): X-Ray with sampling. (b) If you need detailed logs (debugging failures): CloudWatch Logs 100%. (c) If you need both: use both (hybrid, $50-100/month total). (5) Implementation: (a) Configure X-Ray recorder in Lambda to sample 2% of requests (not 100%). (b) CloudWatch Logs: capture all request/response + errors (100%). (c) Correlate: add X-Ray trace ID to CloudWatch Logs. Query: `fields @timestamp, @message, trace_id` where `trace_id = `. (d) Dashboard: X-Ray service map (show service dependencies), CloudWatch Insights (show error patterns). (6) Cost after optimization: X-Ray 2% sampling = $300/month → $6/month (50x reduction). CloudWatch Logs = $5/month. Total: $11/month (vs $5K before). ROI: $50K savings/year, zero visibility loss (2% sampling is statistically representative).

Follow-up: At 2% sampling, the latency spike at 2:30 PM (99.9th percentile) wasn't captured by X-Ray (unlucky). CloudWatch Logs also didn't flag it (no threshold alert). How do you catch rare, extreme outliers?

Your application is distributed across 20 Lambda functions, 5 DynamoDB tables, 3 RDS databases, and 2 S3 buckets. A customer reports: "My transaction failed after 10 minutes of waiting. No error message." CloudWatch shows no errors (status 200 OK). X-Ray shows trace ended prematurely. How do you debug silent failures?

Debug silent failures (no errors but request hangs): (1) Symptoms: (a) Request times out after 10 min (Lambda max = 15 min or API Gateway max = 29 sec?). (b) Status 200 OK (success), but no result returned (orphaned request). (c) X-Ray trace cut off mid-request (trace incomplete). (2) Root cause analysis: (a) Check Lambda timeout: if set to 5 min, execution stops at 5 min, returns incomplete response (HTTP 200 but missing data). Lambda doesn't error; application code wasn't awaited. (b) Check database connections: if Lambda exhausts RDS connection pool, subsequent requests hang forever (timeout, no error). (c) Check DynamoDB throttling: if DynamoDB hits capacity, request blocks (not error). (d) Check deadlock: if multiple services wait for each other in circular dependency (A waits for B, B waits for A), hang forever. (3) Investigation steps: (a) X-Ray trace inspection: trace shows "Pending" segment (still executing). Click segment → see: (i) Duration: 10:00+ (exceeded expected <100ms). (ii) Subsegment: stuck in RDS query (e.g., SELECT waiting on lock). (iii) Inference: RDS deadlock or query hung. (b) RDS Insights: query execution history. Find 10-min hanging query. Show: "Query blocked by another query (ID: xyz)". Indicate: table locked. (c) CloudWatch Logs: Lambda logs show "Connecting to RDS... [stuck, no log after this]". Indicates: connection acquisition timeout. (4) Common causes + fixes: (a) Lambda connection pool exhaustion: (i) Lambda instantiates new RDS connection per invocation (wrong pattern). (ii) After 100 concurrent invocations, RDS connection limit (default 20) is reached. (iii) New invocations hang waiting for available connection. (iv) Fix: use RDS Proxy (serverless connection pooler). Lambda → RDS Proxy (pools connections) → RDS. Unlimited concurrency. Cost: $0.015/connection/hour = ~$11/month for 10 pooled connections. (b) DynamoDB throttling: (i) DynamoDB has provisioned capacity (e.g., 100 WCU). If request exceeds, blocks. (ii) Fix: use on-demand billing (automatic scaling, no throttling). Cost: higher per-request, but no hanging. (c) Query deadlock: (i) Two transactions update same rows in different order. (ii) Fix: add timeout to RDS transaction (e.g., 5-second lock timeout). If lock not acquired in 5 sec, fail fast (don't hang 10 min). (iii) Retry with exponential backoff. (5) Monitoring: (a) Add CloudWatch alarm: if Lambda invocation duration > 5 sec, alert. (b) X-Ray: track invocation-to-completion latency. Alert if any invocation > 10 sec. (c) RDS: monitor connection pool size. Alert if utilization >80%. (6) Implementation: RDS Proxy + timeout configuration = 1 week. Cost: $11/month + optimization savings (no more hanging requests = happier customers).

Follow-up: You enabled RDS Proxy but transaction went from hanging 10 min to failing in 5 sec (timeout). Deadlock is now visible (errors thrown). Do you increase timeout to hide the deadlock, or fix the root cause (transaction logic)?

Your company has CloudWatch but not X-Ray. You want to implement distributed tracing on a budget. Can you use CloudWatch Logs + correlation IDs as a DIY X-Ray replacement? What are the trade-offs?

DIY distributed tracing via CloudWatch Logs + correlation IDs (budget-friendly X-Ray alternative): (1) Architecture: (a) Correlation ID: generate UUID at API Gateway, pass to all downstream services via header/context. (b) Log injection: every log statement includes correlation ID: `{ timestamp, level, message, correlation_id, service, duration }`. (c) CloudWatch Logs Insights: query all logs for correlation_id → reconstruct full request trace. (2) Implementation (example): ```javascript // Lambda function, all services const correlationId = event.headers['x-correlation-id'] || uuid(); const logger = { log: (msg, level = 'INFO') => console.log(JSON.stringify({ timestamp: new Date().toISOString(), level, message: msg, correlation_id: correlationId, service: 'api-service', duration_ms: Date.now() - startTime })) }; logger.log('Request started'); // => logs to CloudWatch const result = await queryRDS(); // RDS client also injects correlation_id via SDK interceptor logger.log('RDS query complete'); ``` (3) Query reconstruction (CloudWatch Logs Insights): ```query fields @timestamp, @message, @duration | filter correlation_id = 'abc-123' | sort @timestamp asc | stats sum(@duration) as total_duration ``` Output: all events for abc-123, chronological order, total duration. Pseudo X-Ray. (4) Trade-offs: (a) Pros: (i) Zero cost (CloudWatch Logs = $0.50/GB ingested). 100K requests × 1KB logs = $0.15/month. (ii) No external service (X-Ray) to manage. (iii) Logs are queryable (CloudWatch Insights is powerful). (b) Cons: (i) Manual service map: you must manually build correlation ID injection into 20+ services. X-Ray auto-injects via SDK. (ii) No automatic service map visualization (X-Ray has pretty graph). (iii) Slower to query: CloudWatch Logs Insights takes 5-10 sec to scan logs vs X-Ray instant lookup. (iv) No automatic error tracking: you must parse error logs manually. (v) Limited sampling: if you log all requests, costs still $2-5/month (10x cheaper than X-Ray, but not free). (5) Hybrid recommendation: (a) Use CloudWatch Logs for 100% of requests (cheap). (b) Use X-Ray sampling 1-2% for latency outliers + service map. (c) Correlation ID in both. (d) When debugging, drill: X-Ray trace → CloudWatch Logs with correlation ID. (e) Cost: $5/month CloudWatch + $5/month X-Ray (1% sampling) = $10/month. Much cheaper than X-Ray 100%. (6) Timeline: DIY logging = 2 weeks (implement correlation ID in all services). ROI: deferred until you hit a P0 incident where 10-minute drill-down saves 2 hours (vs 30-minute drill with X-Ray). Pragmatic: start with DIY, migrate to X-Ray sampling when team scales.

Follow-up: Your DIY logging is working but correlation IDs aren't propagating across async operations. A Lambda triggers SQS, SQS triggers another Lambda. Each Lambda has different correlation_id. How do you maintain traceability across async boundaries?

You enabled X-Ray and CloudWatch Logs together. Now you have duplicate data: X-Ray shows trace, CloudWatch Logs shows detailed error. But storing both costs $1K/month (X-Ray $900, Logs $100). You're asked to cut costs to $200/month. Kill X-Ray or Logs?

Cost optimization: keep CloudWatch Logs, downsample X-Ray to 2% (total ~$150/month): (1) Cost breakdown: (a) X-Ray: $5/M segments × 180M segments/month (100 req/sec × 3K segments each) = $900/month. High because of over-instrumentation. (b) CloudWatch Logs: $0.50/GB × 200GB/month = $100/month. Reasonable. (c) Total: $1,000/month. Target: $200/month (80% reduction). (2) Analysis: (a) Do you need X-Ray at all? X-Ray is for latency profiling + service map. CloudWatch Logs is for detailed debugging. (b) Service map benefit: see Lambda → RDS → S3 dependency graph. But you can maintain this manually (or use AWS API documentation). (c) Latency profiling: CloudWatch Logs can include duration field. Query slowest requests. Less elegant but 50x cheaper. (3) Recommendation: (a) Kill X-Ray entirely (save $900). Cost: $100/month. (b) Enhanced CloudWatch Logs: add duration metrics + sampling. (c) When you need latency visibility (incident), enable X-Ray temporarily (1 week at $10/month cost). (d) Service dependency: document in architecture diagrams (static, not real-time). (4) Hybrid (if service map is critical): (a) X-Ray sampling: 2% of requests. Cost: $900 × 0.02 = $18/month. (b) CloudWatch Logs: 100%. Cost: $100/month. (c) Total: $118/month (88% reduction). (5) Implementation: (a) Configure X-Ray recorder to sample 2%. (b) Add duration_ms to CloudWatch Logs. (c) CloudWatch Logs Insights query: `stats avg(duration_ms), max(duration_ms) by service`. (d) Build static service map doc (maintained in Git). (6) Monitoring: (a) Weekly dashboard: latency by service (from CloudWatch Logs). (b) If latency anomaly, enable X-Ray sampling temporarily. (c) Alert: if latency > threshold, auto-enable X-Ray 1% sampling for 1 hour. Cost: ~$5 per incident. Acceptable trade-off.

Follow-up: You cut X-Ray to 2% sampling. A P0 incident at 3:30 PM, but no X-Ray trace (wasn't in the 2% sample). CloudWatch Logs don't have enough context (only summary metrics). 1-hour incident response delay. How do you prevent this?