Prometheus Interview — Exemplars and Trace Correlation

Your team uses Prometheus for metrics and Jaeger for distributed traces, but there's a gap: when you see a latency spike in Prometheus (P95 = 500ms), you manually search Jaeger for traces from that time window to investigate. This is tedious. How do you use exemplars to link metrics to traces automatically?

Exemplars are a reference from a metric (sample) to a trace. When an exporter emits a metric, it includes a trace ID (or span ID) as exemplar data. Prometheus stores exemplars separately (not in time-series), and UIs (Grafana, Prometheus) can link from a metric to the trace. Implementation: (1) In your application, extract the trace ID from the current span context: traceID = tracer.currentSpan().traceID(). (2) When recording a metric, attach the trace ID: histogram.observe(latency, exemplar = traceID). (3) Export format: Prometheus OpenMetrics format (text format version 1.0) includes exemplar: 'http_request_duration_seconds_bucket{le="0.5"} 100 # {traceID="abc123"} 500ms timestamp'. The '#' syntax indicates an exemplar. (4) Prometheus scrapes and stores exemplars in memory (not on disk). Exemplars have TTL (~5 minutes, configurable). (5) Grafana dashboards: when querying 'histogram_quantile(0.95, http_request_duration_seconds_bucket)', hover over a data point. Grafana shows exemplars for that bucket, with a link to Jaeger/Tempo. Click to open trace. (6) Configure Prometheus: exemplars: enabled: true in scrape_configs. (7) For applications: use a Prometheus client that supports exemplars (Go, Java, Python libraries). For collectors: most modern exporters support exemplars in their latest versions.

Follow-up: If an exemplar trace ID is sampled (not all requests are traced), how does this affect latency percentile calculations?

You've implemented exemplars, and now Prometheus stores 10 exemplars per metric bucket per scrape. With 1000 metrics and 10 scrapes/minute, that's 100k exemplars/minute. Prometheus memory is growing. How do you manage exemplar memory and retention?

Exemplars are stored in memory and use non-trivial space. Manage with: (1) Limit exemplars per metric: Prometheus has a default exemplar limit (~100 exemplars per series). Set --storage.exemplars.max-exemplars (default 100,000 for entire instance). With 1000 metrics, this is ~100 exemplars per metric, acceptable. (2) Exemplar TTL: exemplars expire after ~5 minutes (--storage.exemplars.exemplar-retention-duration). Old exemplars are dropped automatically. (3) For histogram metrics with many buckets (20+), each bucket can store exemplars independently. This multiplies storage: 1 metric × 20 buckets × 100 exemplars = 2000 exemplars per metric family. Monitor prometheus_tsdb_exemplar_* metrics to track usage. (4) Sampling: not all metrics need exemplars. Sample high-value metrics (high-cardinality latency histograms, error rates). Skip low-importance metrics (static counters, gauges). In instrumentation: only emit exemplars for sampled traces: if sampling_enabled { emit_exemplar() }. (5) Filtering: use metric_relabel_configs to drop exemplars for low-priority metrics: exemplar_relabel_configs: [ { source_labels: [__name__], regex: 'low_cardinality_.*', action: 'drop' } ]. (6) Monitoring: alert if prometheus_tsdb_exemplar_* grows beyond threshold. Example: 'prometheus_tsdb_exemplar_series_with_exemplars > 10000' indicates too many exemplars. (7) For production at scale (1B+ samples/sec), consider externalizing exemplars to a dedicated backend (Tempo, Grafana Tempo) instead of storing in Prometheus.

Follow-up: If Prometheus runs out of exemplar memory (exceeds max-exemplars), which exemplars are evicted first?

You're using exemplars to link metrics to traces, but not all endpoints have traces (some services don't use tracing). For endpoints with traces, exemplars work. For others, clicking an exemplar link returns 404 in Jaeger. How do you handle heterogeneous tracing coverage in a system with exemplars?

Exemplars are optional; not all metrics need them. In heterogeneous environments: (1) Conditional exemplar emission: apps only emit exemplars if tracing is enabled. Check at runtime: if tracer.isActive() { emit_exemplar() }. (2) Graceful degradation: configure Grafana to handle missing traces. If Jaeger link returns 404, Grafana shows "trace not found" instead of breaking. (3) Exemplar filtering in Prometheus: use exemplar_relabel_configs to drop exemplars from untraced services. For example, drop exemplars from services with label 'tracing=false'. (4) Fallback: use span IDs instead of trace IDs for untraced services. Span IDs can link to structured logs (ELK, Splunk) if tracing isn't available. Exemplar format supports both traceID and spanID fields. (5) Migration path: as you add tracing to more services, exemplars gradually cover more endpoints. Until full coverage, use multiple linking methods: (a) Metrics → Traces (via exemplars for traced services). (b) Metrics → Logs (via job + timestamp, for untraced services). (c) Dashboards can show both options. (6) Monitoring: track the percentage of exemplars that resolve to actual traces. Alert if resolution rate drops (indicates tracing backend issues). Query: 'count(exemplars_with_valid_traces) / count(total_exemplars)'. If < 80%, investigate trace collection. (7) For large-scale systems, use probabilistic sampling: emit exemplars for 1% of traces, ensuring coverage without overwhelming memory.

Follow-up: If exemplar trace IDs refer to traces that are sampled out and no longer in Jaeger, how long before the exemplar link breaks?

You've set up exemplars for a high-cardinality distributed system (100 services, 1000 endpoints). Exemplars work locally, but when you upgrade Prometheus from 2.40 to 2.45, exemplar format changes and old exemplars become unreadable. How do you handle exemplar format migration?

Exemplar format is versioned; incompatible upgrades can break exemplars. Mitigation: (1) Check Prometheus release notes for exemplar format changes. Typically, format is backward-compatible within a major version. (2) Exemplars in memory don't persist across restarts (they're not on disk like time-series). When you restart Prometheus after upgrade, old exemplars are flushed. New exemplars use the new format. (3) No migration needed: after upgrade and restart, Prometheus starts fresh with new exemplar format. However, you lose historical exemplars from before the upgrade (not a big deal since exemplars have 5-minute TTL anyway). (4) For rolling upgrades: if you have multiple Prometheus replicas, upgrade them one at a time. Metrics continue flowing, and exemplars reset on each upgrade. (5) If exemplar format becomes incompatible: (a) Upgrade test Prometheus locally first to verify exemplars still work. (b) Monitor exemplar_* metrics after upgrade to ensure exemplars are being stored. (c) If Grafana links break, check Prometheus exemplar output format: 'curl http://prometheus:9090/api/v1/query?query=http_request_duration_seconds_bucket&_match_exemplars=true' to see exemplar format. (d) Update Grafana's exemplar datasource configuration if endpoint format changed. (6) For zero-downtime upgrades: use a proxy that upgrades Prometheus instances behind the scenes while keeping the API consistent.

Follow-up: If exemplar format changes and Grafana's exemplar link format also changes, how do you ensure compatibility between old Prometheus and new Grafana?

Your team is using exemplars to correlate metrics with traces. However, you notice that during high-traffic periods (1M+ requests/sec), exemplars are sampled heavily (only 1% of traces are included as exemplars). P95 latency shows 200ms, but the exemplars linked show 5s latency (outliers). Are exemplars representative of typical behavior?

Exemplars are a sample of traces, not a complete view. Under heavy load, sampling bias is inevitable: (1) Sampling strategy: if sampling rate is 1%, exemplars represent 1% of traces. High-percentile exemplars (P95+) are more likely to be slow traces because slow requests are more likely to be sampled or have trace context. (2) Bias: exemplars tend to be unrepresentative of the full distribution. For example, if your app samples traces for errors (sampling_priority="error"), exemplars will be skewed toward error cases, not typical requests. (3) Mitigation: (a) Use distributed sampling: sample uniformly (e.g., 1% of all requests), not biased toward slow/error. (b) Display exemplar statistics in Grafana: show number of exemplars, sampling rate, and a disclaimer that they're not fully representative. (c) Cross-check with percentiles: if P95 histogram shows 200ms, but exemplars average 5s, investigate sampling bias. Query Jaeger directly with time range to get true distribution. (d) Use multiple exemplars per metric: instead of 1 exemplar per bucket, store 10-100 exemplars per bucket. This provides more representative coverage. (e) For production SLO monitoring, don't solely rely on exemplars. Use both histogram_quantile (for accurate percentiles) and exemplars (for debugging). (4) Transparency: document exemplar limitations in dashboards. Add notes: "Exemplars sampled at 1%; may not represent all behavior." (5) Monitoring: track exemplar sampling rate and alert if it drops below expected threshold (indicates system under stress).

Follow-up: If exemplars are skewed toward slow traces, and you use them to debug latency, will you always focus on the slowest edge cases instead of typical slow behavior?

You're implementing OpenTelemetry (OTel) for your microservices and want exemplars in Prometheus to link to OTel traces. However, your app traces use OTel trace IDs (128-bit hex), but your Prometheus exemplar implementation expects simple IDs. How do you adapt exemplars to OTel trace IDs?

OpenTelemetry uses W3C Trace Context: traceID (128-bit), spanID (64-bit). Exemplars support both. Adaptation: (1) OTel trace ID format: '4bf92f3577b34da6a3ce929d0e0e4736' (32-char hex). Exemplar format accepts traceID as a string, so pass it directly: exemplar = { traceID: otel_trace_id }. (2) In Go (with OpenTelemetry): metrics.histogram.Record(context.Background(), duration, metric.WithAttributes(attribute.String("traceID", otel_span.TraceID().String()))). (3) In OpenMetrics format: exemplar includes 'traceID' field: 'metric_bucket{le="1.0"} 100 # {traceID="4bf92f3577b34da6a3ce929d0e0e4736"} 0.05 timestamp'. (4) Prometheus scraping: Prometheus expects exemplars in OpenMetrics format. If your app exports in Prometheus text format (not OpenMetrics), exemplars won't be parsed. Check exporter: must use OpenMetrics format explicitly. (5) Configure Prometheus: set --enable-feature=exemplars or use exemplars: true in scrape_configs. (6) Grafana linking: configure Grafana's trace datasource to accept W3C trace IDs. In Grafana, under Traces, set datasource and URL pattern. Trace link format should accept both old and new trace ID formats. (7) For mixed OTel and non-OTel services: map different trace ID formats to appropriate backends. Use relabel_configs to detect format and route accordingly. (8) Testing: verify end-to-end: emit metric with exemplar → Prometheus scrapes → Grafana displays → click link → Jaeger/Tempo opens with correct trace.

Follow-up: If you have both Jaeger (internal format) and OTel (W3C format) traces, can Prometheus exemplars link to both?

You're using exemplars for debugging latency spikes, but the exemplars are being sampled based on the tracing backend's sampling decision, not Prometheus's. Some high-latency requests aren't traced at all, so they never become exemplars. How do you ensure critical exemplars (slow requests, errors) are always captured?

Tracing and metrics sampling are independent. If a request isn't traced, it can't become an exemplar. To ensure critical exemplars: (1) Configure head-based sampling in OTel: sample all requests (or errors, high-latency) upstream, before metrics are recorded. Example: 'sampler: always_on' samples everything; 'sampler: always_off' samples nothing. Use 'sampler: adaptive' to adjust based on traffic. (2) Tail-based sampling: use OTel Collector with tail_sampling processor. Sample based on request attributes (latency, error status) after seeing the full span. Configure: processors: [ tail_sampling: [ policies: [ { name: "high_latency", high_latency: { threshold_ms: 1000 } }, { name: "errors", status_code: { status_codes: [ERROR] } } ] ] ]. (3) Critical path sampling: ensure traces for important operations (payment, authentication) are always sampled. Use 'sampling_priority' span attribute: span.setAttribute("sampling_priority", "true"). Configure OTel to honor this flag. (4) Exemplar filtering: use metric_relabel_configs to drop exemplars that aren't critical, preserving memory for high-value ones. Example: drop exemplars for 'low_priority_endpoint' metric. (5) Dual sampling: run two OTel collectors: (a) Metrics-only (lower sample rate for volume). (b) Traces-only (higher sample rate for critical paths). Correlate them via trace ID. (6) Monitoring: track exemplar representativeness. Query: 'count(exemplars where latency > p95) / count(high_latency_requests)'. Alert if < 50% (means important slow requests aren't traced). (7) Cost-benefit: increasing sample rate increases costs. Balance between visibility (high sample rate) and cost (low sample rate).

Follow-up: If tail-based sampling drops slow requests (not matching any policy), and you want exemplars for slow requests, how do you fix the sampling policy?

Your Prometheus + Grafana + Jaeger setup uses exemplars. However, Prometheus stores exemplars in memory only. If Prometheus crashes and restarts, exemplars are lost. For compliance (audit trail), you need exemplars to persist. How do you make exemplars persistent?

Exemplars in Prometheus are volatile (not persisted to disk). For persistence: (1) Accept ephemeral exemplars: exemplars have ~5-minute TTL anyway. For debugging live issues, this is acceptable. Accept that old exemplars are lost on restart. (2) Export exemplars: periodically export exemplars to a permanent store (S3, database) via Prometheus API: 'curl http://prometheus:9090/api/v1/query?query=&_match_exemplars=true' returns exemplars. Parse and store in database. Cron job to export hourly. (3) Use Grafana Tempo or OpenTelemetry Collector: these backends are designed for long-term trace storage. Configure Prometheus to send exemplars to Tempo (via prometheus.remote_write externalLabels with trace data). Tempo stores traces durably. (4) Tracing backend as source of truth: if Jaeger/Tempo already stores all traces, exemplars in Prometheus are just pointers. The real persistence is in the tracing backend. When Prometheus crashes, exemplars reset, but you can still query the tracing backend directly by time range. (5) Dual-write: app sends metrics to Prometheus (for exemplars) and traces to tracing backend (for persistence). Metrics → Prometheus (ephemeral), Traces → Jaeger/Tempo (persistent). Exemplars are just convenience links. (6) For compliance audit trails: use OpenTelemetry's baggage to attach audit metadata to spans/metrics. This data is stored in tracing backend. Query traces directly if Prometheus crashes. (7) High-availability: run multiple Prometheus replicas. If one crashes, others maintain exemplar state. Use Prometheus federation to sync exemplars across replicas (though this requires custom logic).

Follow-up: If you export exemplars to a database for compliance, how do you handle PII in trace IDs or exemplar metadata?