Prometheus Interview — Histograms, Summaries, and Quantiles

Your application measures HTTP request latency and you're debating whether to use Prometheus Histogram or Summary. You need P95 latency for SLO monitoring. A histogram uses 10 buckets (1ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s, 10s, 50s, +Inf), but you're worried about losing precision for requests between 50ms and 100ms. Should you add more buckets, use a Summary, or aggregate differently?

Histograms aggregate by bucket; buckets represent ranges, not exact values. With buckets [50ms, 100ms], you lose precision in that range. Choose based on query patterns: Histograms are better for pre-aggregation and percentile queries via histogram_quantile(). Summaries are better if the client (exporter/app) computes percentiles and you just want to read them. For P95 latency: if you need real-time percentiles across instances (e.g., 'p95(http_request_duration_seconds) by (service)'), use histogram_quantile(0.95, http_request_duration_seconds_bucket). The function interpolates between buckets. If your buckets are [50ms, 100ms], histogram_quantile returns an interpolated value between those buckets. Add more buckets if you need finer granularity, but this increases cardinality: each bucket becomes a separate time-series. For latency, common buckets are exponential: [.001, .01, .1, 1, 10] or custom for your SLA (e.g., [.05, .1, .25, .5, 1, 2.5, 5, 10]). Summary is simpler but less flexible: the app computes P95 directly and you query it. However, Summaries don't aggregate well across instances.

Follow-up: If you query histogram_quantile(0.95, metric_bucket) with only 2 buckets [50ms, 100ms], what value does it return for requests at 75ms?

Your team's Python app uses Prometheus Python client and exposes histogram metrics. You notice cardinality is exploding: 1,000 services × 10 buckets × 5 status codes = 50,000 time-series from a single metric. How do you reduce cardinality without losing observability?

Cardinality explosion with histograms is common because buckets multiply the series count. Reduce by: (1) Dropping unnecessary labels: does every bucket need 'status_code'? Instead of histogram_observe(latency, status=status_code), only label by the most important attributes (service, endpoint). (2) Using recording rules to pre-aggregate: record: 'http:latency:bucket' = http_request_duration_seconds_bucket{status=~"2.."}. This filters to successful requests only (fewer status codes). (3) Reducing bucket count: if you have 20 buckets but only use P50/P95, 10 is enough. Fewer buckets = fewer time-series. Use histogram_quantile(0.95, metric_bucket) with fewer buckets; accuracy is acceptable within ~5% error. (4) Using Summaries for high-cardinality metrics: Summary computes percentiles in-process and exposes just 3-5 time-series (sum, count, quantile), not 10+ per label combination. Trade-off: can't aggregate Summaries across instances. (5) Implementing cardinality limits: in the Prometheus client, set histogram.buckets_limit to drop excessive buckets or reject high-cardinality labels. (6) Using exemplars (optional): attach trace IDs to histogram buckets to correlate metrics with traces without exploding cardinality.

Follow-up: If you drop high-cardinality labels from histograms, can you still drill down in Grafana to see per-endpoint latency breakdowns?

You're computing P99 latency for an SLO dashboard using histogram_quantile(0.99, histogram_bucket), but the value changes dramatically (swings between 100ms and 500ms) when you refresh the dashboard. You suspect it's due to empty buckets or insufficient data. How do you debug and stabilize quantile queries?

Quantile instability can be caused by: (1) Empty buckets: if a bucket has zero samples, histogram_quantile interpolates using adjacent buckets, leading to jumps. Verify bucket distribution with: 'histogram_quantile(0.99, histogram_bucket) > 0' and check count(histogram_bucket > 0) to see how many buckets have data. (2) Low cardinality: if you query P99 across 5 instances but only 2 have data, the result is skewed. Use 'histogram_quantile(0.99, sum(rate(histogram_bucket[5m])) by (le))' to aggregate first, then quantile. (3) Query window too short: use [5m] or [1h] time window in rate() to smooth out noise. (4) Bucket precision: if buckets are [1ms, 10ms, 100ms, 1s], and all requests fall into the 1-10ms bucket, P99 is close to 10ms (upper bound). Add finer buckets (1ms, 5ms, 10ms) for precision. (5) Use on(le) group_left() to ensure consistent bucket alignment across instances before quantile. (6) For dashboards, use a longer range_step (e.g., 1m instead of 15s) to reduce query frequency and stabilize values.

Follow-up: If histogram_quantile returns NaN (e.g., no data in buckets), how do you handle this in Grafana without the panel breaking?

Your team uses both Prometheus Histogram (bucketed latency) and Jaeger traces (individual request latency). Histograms show P95 = 200ms, but Jaeger traces show individual requests with 1s latency. The percentiles don't match. Why?

Histograms and traces sample differently. Histogram buckets count all requests (unless your app samples), so P95 is computed from the full distribution. Traces sample a subset (default 1% or configurable), so the full distribution isn't visible. Discrepancies can occur if: (1) Histogram uses different label filters than trace query. For example, histogram includes all statuses (200, 500), but trace query filters to status=200 only. (2) Time window mismatch: histogram_quantile(0.95, [5m]) covers 5 minutes, but trace query shows the last hour. (3) Bucket truncation: if histogram buckets don't cover the full range (max bucket is 1s but some requests are slower), P95 is capped at 1s. (4) Different aggregation: histogram P95 is interpolated, but trace P95 is computed exactly from sorted requests. To reconcile: (a) Use exemplars: attach trace IDs to histogram buckets, so you can click through from the histogram to actual traces. This validates that high-percentile requests are captured. (b) Set histogram buckets to cover the full range of observed latencies. (c) Adjust trace sample rate to match the slice of production traffic. (d) Query both with identical filters (same time window, labels, status codes).

Follow-up: If histogram_quantile and trace P95 differ by 50%, how do you determine which is "correct" for SLO calculation?

You're implementing SLO monitoring for a customer-facing API. The SLO is "95% of requests < 200ms". You have a histogram with buckets up to 1s. You query: 'count(rate(api_duration_bucket{le="0.2"}[5m])) / count(rate(api_duration_bucket{le="+Inf"}[5m]))' to compute the success ratio, but the result is sometimes > 1. How is this possible and how do you fix it?

This query has several issues: (1) Counting is wrong: you should sum, not count. 'count()' returns the number of time-series, not the total request count. Use 'sum(rate(api_duration_bucket{le="0.2"}[5m])) / sum(rate(api_duration_bucket{le="+Inf"}[5m]))'. (2) Bucket semantics: le="0.2" is the bucket for requests <= 200ms. le="+Inf" is the total request count. (3) Result > 1 suggests the numerator (successful requests) is higher than denominator (total), which is impossible. This can happen if: rates have different cardinality (some instances report le="0.2" but not le="+Inf"), or scrape failures cause inconsistent samples. Fix by: using histogram_quantile instead of manual calculation: 'histogram_quantile(0.95, sum(rate(api_duration_bucket[5m])) by (le)) < 0.2'. This returns true if P95 < 200ms. For SLO: 'count(rate(api_duration_bucket{le="0.2"}[5m]) > 0) / count(rate(api_duration_bucket{le="+Inf"}[5m]) > 0)' counts how many service instances meet the SLO. Or use: 'sum(rate(api_duration_bucket{le="0.2"}[5m])) / sum(rate(api_duration_bucket{le="+Inf"}[5m]))' to compute the global success ratio.

Follow-up: If you have multiple instances and some report 500 requests, others report 400, does histogram_quantile aggregate correctly or can this introduce bias?

Your Go microservice uses Prometheus instrumentation and you're deciding between Summary and Histogram for response time. The service is high-volume (100k RPS). You want P99 latency. Summaries are simpler but you've heard they don't aggregate well. Histograms use more disk but aggregate. What's the right choice for high-volume production?

For high-volume production: choose Histogram. Summary compute percentiles in-process and expose pre-computed quantiles (0.5, 0.9, 0.99). This is convenient but has drawbacks: (1) Pre-computed quantiles are fixed at instrumentation time (can't compute arbitrary percentiles later). (2) Summaries don't compose: if you want P99 across 100 instances, you can't average pre-computed P99s. You must re-compute from raw data (histograms). (3) At 100k RPS × 3 quantiles (P50, P90, P99) × 100 instances = 300k time-series for Summary vs 100k × 15 buckets = 1.5M time-series for Histogram. So Histogram has higher cardinality but is more flexible. For high-volume, use Histogram with: (a) Reasonable bucket count (10-15 buckets, not 100). (b) Recording rules to pre-aggregate: 'record: service:p99_latency = histogram_quantile(0.99, sum(rate(histogram_bucket[5m])) by (service, le))'. This reduces query load. (c) Downsample very old data: use Thanos with downsampling to keep disk usage low for long-term storage. Summary is acceptable if you only need a few fixed quantiles and don't need to re-aggregate.

Follow-up: If you use histogram_quantile with buckets that have few samples (e.g., < 10 requests per bucket), how accurate is the percentile calculation?

You're writing a Prometheus alerting rule: 'alert if P95 latency > 500ms'. You write: 'ALERT HighLatency if histogram_quantile(0.95, histogram_bucket) > 0.5'. However, the alert fires sporadically during normal traffic. Investigating, you see the P95 value is noisy (swings from 100ms to 800ms). Why does quantile volatility cause false alerts?

Quantile volatility is caused by: (1) Low sample count per bucket: with few requests, each request in a bucket significantly changes the percentile. Example: if a 50ms bucket has 5 requests and one request jumps to 51ms bucket, the percentile shifts. (2) Bucket quantization: interpolation between buckets can cause jumps when the distribution changes slightly. (3) Short time windows: if you evaluate the alert every 15 seconds over a [1m] window, the percentile can vary if requests cluster differently. To stabilize: (a) Use a longer time window in the rule: 'histogram_quantile(0.95, rate(histogram_bucket[5m]))' instead of [1m]. Longer windows smooth out noise. (b) Add a threshold and hold: 'alert if histogram_quantile(0.95, ...) > 0.5 for 5m'. This prevents single spikes from firing alerts. (c) Use 'count(rate(histogram_bucket[5m]) > 0)' to ensure sufficient traffic before alerting: only alert if > 1000 requests. (d) Increase histogram bucket precision in the ranges you care about (below 500ms), so interpolation is more accurate. (e) For P99+, be more conservative: high percentiles are inherently noisy. Use a higher threshold or longer hold time.

Follow-up: If you set for: 5m on a histogram_quantile alert, does Prometheus compute the quantile fresh every 15s, or cache it?

You're migrating from custom percentile calculations (manually collecting and sorting latencies every minute) to Prometheus histograms. Your old system exported a metric 'custom_p99_latency = 450ms'. Your new histogram with histogram_quantile(0.99, ...) returns 380ms. The values don't match. Did you configure the histogram wrong or is the difference expected?

Differences between custom percentile calculations and histogram_quantile are expected due to different aggregation methods. Custom calculations (e.g., sorting all samples) compute exact percentiles. histogram_quantile uses linear interpolation between buckets, which is approximate. Causes of discrepancy: (1) Bucket boundaries: if your custom system buckets at 1ms increments but histogram uses [50ms, 100ms, 500ms], precision is lost. (2) Aggregation: custom calc might use different subset (e.g., last 60 seconds), while histogram_quantile(rate(...[5m])) uses 5 minutes. (3) Missing high values: if histogram max bucket is 1s but custom calc has requests > 1s, the histogram P99 is capped at 1s. To align: (a) Set histogram buckets to match the precision of your custom system. (b) Query with the same time window (e.g., if custom uses 1m, use rate(...[1m])). (c) Validate by querying both systems on the same traffic and comparing several percentiles (P50, P95, P99). Differences < 5-10% are acceptable. (d) Use exemplars to spot-check: attach trace IDs to high-percentile requests and verify they match manually. Histogram_quantile is a production-acceptable approximation; the trade-off is cardinality reduction and aggregation capability.

Follow-up: If histogram buckets are not uniform (e.g., [1ms, 10ms, 100ms, 1000ms]), does histogram_quantile's linear interpolation still work correctly?