Grafana Interview — Tempo and Distributed Tracing

Your microservices (50 services, 5K RPS) generate 100K traces/second with an average of 100 spans per trace. You're storing traces in Tempo, but the cluster is at 95% disk usage after just 1 week. At this rate, you'll run out of storage in days. Design a trace sampling and retention strategy that keeps storage costs manageable while maintaining enough data for production debugging.

Implement multi-level trace sampling with storage tiering: (1) Head-based sampling—at trace ingestion, sample based on trace properties: keep 100% of error traces, 100% of traces with high latency (>1s), 50% of traces with warnings, 10% of normal traces. Use Tempo's sampling rules to implement this. (2) Tail-based sampling—after a trace completes, decide whether to keep it based on full context (if any span experienced an error, keep entire trace). This requires trace-to-backend communication. (3) Adaptive sampling—monitor storage growth; if approaching limits, increase sampling rate dynamically (drop to 5% of normal traces). Use feedback from Tempo cluster metrics. (4) Time-based retention—store 7 days of all sampled traces hot (queryable), 30 days in warm tier (S3, slower queries), 1 year in cold tier (archived). Implement Tempo's retention policies accordingly. (5) Service-level SLOs—define per-service retention: critical services (payment, auth) keep 100% of error traces, experimental services keep 1% of normal traces. Store policies in Git. (6) Cost tracking—expose storage cost per service and per sampling strategy; use this to guide team decisions on acceptable trace volume. Implement cost budgets: "You have 1TB/month; adjust sampling if you exceed." For critical debugging, maintain a "always sample" list of high-interest trace attributes (customer_id, request_id, order_id) ensuring traces containing these are always kept.

Follow-up: Your sampling is too aggressive—an incident happens and the critical trace that would have caught it was sampled out (1% sampling rate). How do you balance cost and debugging completeness, and what safety margins would you build in?

A user is debugging a latency issue in a microservice. They run a trace query: "find traces with latency > 1s." Tempo returns 50K traces matching this criteria. Exporting and analyzing this is impractical. Design a trace analysis system that summarizes and correlates large trace sets, surfacing root causes instead of overwhelming users with raw data.

Implement intelligent trace analytics: (1) Trace aggregation—group traces by key attributes (service, endpoint, error_type, client_region) and summarize statistics: median latency, 95th percentile, error rate within each group. Show top-N groups by latency or error count. (2) Flame graph aggregation—instead of showing individual flame graphs, show aggregated flame graph: "average time spent in each service." Identify which service is the bottleneck across all traces. (3) Span-level analytics—across all 50K traces, identify which spans consistently contribute to high latency. Show: span type, median duration, variance. Flag spans with high variance (e.g., database query sometimes 10ms, sometimes 2000ms—high variance indicates instability). (4) Error correlation—group traces by error type; show which errors are most common, which services originate them, what's the recovery pattern. (5) Critical path analysis—compute the critical path (longest chain of dependencies) across aggregated traces. Show which services are on the critical path and where optimization would have most impact. (6) Comparison queries—"compare latencies from 2 hours ago vs now." Show metrics divergence, pinpointing anomalies. (7) Regression detection—identify services/endpoints where latency recently increased vs. historical baseline. Use statistical significance testing. Implement a trace analysis dashboard showing all the above; allow filtering by service, region, client, error type. For exporting, provide CSV summaries instead of raw traces, enabling analysis in spreadsheets or BI tools.

Follow-up: Your aggregated flame graph shows 40% time in a database service, but when you drill into individual traces, you see variance: some traces spend 10%, others 80%. The aggregated view hides this variance. How would you surface outliers and multimodal distributions?

Your company integrates with a third-party API (payment processor). Traces show requests to the API that sometimes fail with no context about why. The API provider says "go check your logs." You need better observability into cross-service boundaries. Design a solution for tracing through third-party services that don't expose traces.

Implement trace context propagation and enrichment: (1) W3C Trace Context propagation—embed trace ID, span ID, and trace flags in all outbound requests (HTTP headers, gRPC metadata). The third-party service receives these and includes them in error responses or logs. (2) Error response parsing—when a third-party API call fails, parse the response for trace IDs or request IDs and log them. Use correlation IDs to link your trace to theirs. (3) Synthetic tracing—if the third-party doesn't return trace context, create synthetic "remote spans" representing the call, estimating duration from request/response timing. (4) Call-out logging—log all third-party API calls with: request details, response details, duration, error (if any). Ingest these logs into Loki/ELK and correlate with traces via trace ID. (5) API contract versioning—document expected latency SLOs and error rates per API endpoint. Use traces to detect violations: "API returned 500 errors 10% of requests; SLO is 0.1%." (6) Timeout behavior—if an API call times out (no response), generate a synthetic error span in the trace describing the timeout, allowing visualization of what the system was waiting on when it failed. (7) Fallback logging—implement retry/fallback logic and trace it: if primary API fails, log that fallback was attempted, show latency impact. Implement a dashboard showing health of integrations: call success rate, median latency, error rate by error type. Alert on SLO violations.

Follow-up: The third-party API doesn't propagate trace context back in responses. You can't correlate your traces to theirs. The vendor refuses to implement trace context. What's your strategy for observability?

You're implementing distributed tracing for a polyglot environment: Go (gRPC), Python (FastAPI), Node.js (Express), Java (Spring Boot). Each team uses a different tracing library (Jaeger SDK, OpenTelemetry, custom). Traces don't correlate across services because trace IDs are generated inconsistently. How would you standardize tracing across languages and libraries without forcing team rewrites?

Implement a unified tracing standard using OpenTelemetry (OTel): (1) Standardize trace context—use W3C Trace Context (traceparent header) across all services, ensuring trace IDs follow the same format. Deploy OTel SDKs in each language (Go, Python, Node, Java), configured to use Tempo as the backend. (2) SDK integration—provide pre-built OTel integration packages for each framework (FastAPI middleware, Express middleware, gRPC interceptors, Spring Boot auto-configuration). Developers add 2-3 lines of code to enable tracing. (3) Backwards compatibility—for teams with existing tracing (Jaeger, DataDog), implement a translation layer that converts their traces to OTel format before sending to Tempo. (4) Auto-instrumentation—for languages with robust OTel support, use auto-instrumentation agents (Python, Node, Java) that inject tracing without code changes. Go requires more manual instrumentation. (5) Standardized attributes—define company-wide semantic conventions for trace attributes (service.name, service.version, deployment.environment, user.id, request.id). Document in a Git-tracked conventions file. (6) Baggage propagation—implement W3C Baggage propagation for cross-cutting concerns (user_id, request_context, feature_flags). Libraries auto-propagate baggage across service calls. (7) Validation and enforcement—in CI/CD, run OTel compliance checks: "are trace IDs properly formatted? Are standard attributes present?" Fail builds that don't meet standards. Provide auto-fix tools to migrate legacy tracing to OTel. Create per-team onboarding guides with code examples; host weekly office hours. Measure adoption: % of requests traced, trace completeness (all services instrumented). Set targets: "100% tracing adoption in 3 months," incentivizing teams to upgrade.

Follow-up: Your Python team auto-instruments FastAPI, but performance degrades 15% due to tracing overhead. They want to disable tracing in production. How would you balance observability needs with performance?

A customer reports "my requests are slow." You query traces from 2 hours ago, but the trace doesn't show internal service details—only that the request took 5 seconds total. You can't see which internal service is the culprit. Design a solution for capturing end-to-end traces that include both external (customer-perceived) and internal (service-to-service) latency.

Implement comprehensive trace visibility across the boundary: (1) Customer request entry point—when a customer request enters your system (load balancer, API gateway), inject a trace context and trace the entire customer journey: load balancer → API gateway → microservices → database. Each hop adds a span. (2) Gateway-level tracing—in the API gateway, extract trace context from customer request, log it, and propagate to downstream services. Use the gateway to add metadata (customer_id, request_source, geographic region). (3) End-to-end flow visualization—in Tempo, display the full trace: customer request on top, branching to service calls, database calls nested within. Timestamps align, enabling easy identification of where time was spent. (4) Latency breakdown—at the API gateway level, compute and expose latency breakdown: network latency to first service, service processing time, queuing time waiting for resources. (5) Bottleneck identification—flag spans with high latency relative to their duration (e.g., "waited 2 seconds for database connection, but actual query was 100ms"). Surface these in trace visualization. (6) Synthetic monitoring—run regular synthetic customer requests through production, collecting full traces. Alert if end-to-end latency exceeds SLA. (7) Real-time tracing dashboard—show top-N slowest requests in real-time, drill into traces to see where time is spent. Enable grouping by customer, region, endpoint to identify patterns. Implement alerts on latency regression: "P95 latency to checkout endpoint increased 50% in the last hour."

Follow-up: Your trace shows a request took 5 seconds. 4.9 seconds were spent waiting for a lock (database concurrency limit). But the lock wait wasn't instrumented as a span—you see it as a gap in the trace. How would you capture resource contention as spans?

You have 100M traces stored in Tempo. A user wants to search for traces where a specific value (customer_id = 12345) appears in any span attribute. Current trace indexing doesn't support arbitrary-field search; you'd need to scan all 100M traces. This is too slow. Design a flexible trace indexing and search system.

Implement multi-dimensional trace indexing: (1) Span attributes as index keys—for high-cardinality attributes (customer_id, user_id, request_id, order_id), create separate Tempo indices. When ingesting traces, extract these attributes and add to indices. (2) Attribute configuration—define which span attributes are searchable in a Git-tracked config file. Teams can add new searchable attributes; config changes are deployed with trace ingest. (3) Index distribution—use a distributed search backend (ElasticSearch, Bleve) for arbitrary-field search. Mirror span attributes to ES as documents, preserving trace ID linkage. Allow full-text search on ES, returning matching trace IDs, then fetch full trace from Tempo. (4) Probabilistic indices—for very high-cardinality attributes, use probabilistic data structures (Bloom filters) to quickly determine if a trace contains a value, reducing false-positive lookups. (5) Query optimization—support complex queries: "traces where customer_id = 12345 AND status = 'error' AND latency > 1s". Use indices to filter traces efficiently, then apply additional filters. (6) Asynchronous indexing—index span attributes asynchronously after trace ingestion, not during (reduces ingestion latency). (7) Index freshness—accept ~1-2 second lag between trace ingestion and searchability for complex queries, but guarantee searchability for primary attributes within 10 seconds. Implement metrics tracking search query latency; alert if >10s. For very large searches, allow pagination: "limit results to first 10K matching traces," returning results in batches.

Follow-up: Mirroring all span attributes to ElasticSearch inflates storage 5x and ingestion overhead is 30%. The cost increase is too high. How would you reduce mirroring overhead while maintaining search functionality?

Your Tempo deployment is across 3 cloud regions (us-east, us-west, eu-central) for compliance. Traces from multi-region requests (customer in US, backend in Europe) are split across regions. Querying a complete trace requires hitting multiple Tempo instances. Design a globally-distributed tracing system that makes cross-region traces queryable as if they were local.

Implement federated, distributed tracing: (1) Trace routing policy—define which traces go to which region based on customer location, service location, or compliance requirements. For multi-region traces, adopt a "primary region" (where request originated) and replicate span chunks to secondary regions. (2) Distributed span collection—when a request touches services in multiple regions, each region ingests spans from local services. Use trace context propagation to link all spans to the same trace ID. (3) Query federation—implement a query proxy that accepts queries and fans them out to all regional Tempo instances. Merge results locally, reconstructing the complete trace. Measure per-region latency; if one region is slow, return partial results with a note. (4) Trace assembly—for distributed traces, the query proxy reconstructs the full trace by: collecting span chunks from all regions, ordering by timestamp, presenting as a unified trace view. Highlight cross-region spans with latency annotations. (5) Cross-region latency estimation—if a span spans regions (starts in US, ends in EU), estimate network latency between regions and surface as "network latency: 100ms." (6) Compliance-aware routing—enforce data residency: PII cannot leave certain regions. If a query crosses regions, redact PII from non-primary regions before returning. (7) Caching—cache commonly queried distributed traces locally in each region, reducing cross-region latency for repeated queries. Implement analytics: "X% of traces are multi-region; average query latency is Y." For performance, implement query result caching and pre-warming: popular customer traces are replicated to all regions. Set up monitoring of federated query latency; alert if >5 seconds.

Follow-up: Your query federation is slow—merging trace chunks from 3 regions takes 8 seconds. A developer wants to debug a latency issue right now, not wait 8 seconds. How would you provide fast-path queries while maintaining completeness?

You're using Tempo with Loki for logs and Prometheus for metrics. A trace shows a spike in latency, but you can't correlate it to logs (no error logs) or metrics (CPU/memory normal). How would you implement unified correlation across traces, logs, and metrics to enable root cause analysis?

Implement unified observability correlation: (1) Correlation IDs—all traces, logs, and metrics reference a common trace ID. In logs, include trace_id field. In metrics, add trace_id as a label (when relevant). Use Prometheus' exemplars feature to link metrics to traces. (2) Unified query interface—expose a search interface that accepts a trace ID and returns: matching traces in Tempo, matching logs in Loki, matching metrics in Prometheus. Display all data in a unified dashboard. (3) Context propagation—ensure trace context (trace ID, span ID, baggage) is propagated to all telemetry: logs include trace_id in every log line within a trace, metrics include trace labels. (4) Automatic correlation—when viewing a trace, the dashboard automatically surfaces: logs from the same time window with matching trace ID, metrics from the same time window with anomalies. (5) Causality inference—use service dependency maps and correlation analysis to infer likely root causes: if a trace shows high latency, and Prometheus shows CPU spike at the same time, and logs show a deployment event, surface that deployment as probable cause. (6) Drill-down navigation—clicking on a span in a trace takes you to Loki logs for that service; clicking on a metric anomaly takes you to matching traces and logs. (7) Timeline visualization—show a unified timeline: traces on top, logs in the middle, metrics below, all time-aligned. Enable zooming and panning across all three. Implement a "magic query" where you paste a trace ID and get a full root-cause analysis report: metrics changes, log anomalies, dependent services. For debugging, this unified view reduces investigation time from hours to minutes.

Follow-up: Your correlation is showing 50 "probable causes" for the latency spike (logs, metrics, deployments, all time-correlated). The team can't determine which is the actual root cause. How would you rank and surface the most likely root causes?