Prometheus Interview — Recording Rules and Performance

Your Prometheus dashboard for 500 services takes 30 seconds to load because it runs 50 PromQL queries against billions of raw time-series. Each query (like 'avg(rate(http_requests_total[5m])) by (service)') scans the entire dataset. How do you pre-compute and cache these queries using recording rules?

Recording rules evaluate PromQL queries at regular intervals (typically 15s-1m) and store results as new time-series, avoiding re-computation at query time. Create a 'recording.yml' file with: groups: [ { name: 'http_metrics', interval: '30s', rules: [ { record: 'http:requests:rate5m', expr: 'rate(http_requests_total[5m])' }, { record: 'http:requests:rate5m_by_service', expr: 'rate(http_requests_total[5m]) by (service)' } ] } ]. Now queries like 'http:requests:rate5m_by_service' return instant results without scanning billions of samples. Evaluation happens at 30-second intervals, adding ~1% CPU overhead. For dashboards, replace raw queries with pre-aggregated recordings. Trade-off: 30-second delay between metric change and dashboard update, but 10-100x query speedup. For real-time dashboards, keep a few raw queries; for historical/SLO dashboards, use 100% recordings.

Follow-up: If a recording rule fails to evaluate (e.g., high cardinality explosion), does Prometheus skip that rule or halt all rule evaluation? How do you handle evaluation errors?

You've added recording rules, and now Prometheus CPU usage jumped from 20% to 80%. Rule evaluation is taking too long. You have 100 recording rules evaluating every 15 seconds, and some queries involve joins (group_left) that have high computational cost. How do you optimize rule performance?

Recording rule CPU overhead scales with query complexity and evaluation frequency. Optimize by: (1) Increasing evaluation interval: change from 15s to 30s or 60s (trades freshness for reduced CPU). For non-critical metrics, use 5m intervals. (2) Simplifying queries: replace group_left joins with pre-aggregated data. Instead of 'metric1 by (service, zone) > on (service) group_left(team) teams_table', pre-aggregate 'metric1 by (service)' and join in the dashboard. (3) Parallelizing rule evaluation: use multiple rule groups (each group evaluates serially but groups evaluate in parallel). Split 100 rules into 5 groups of 20. (4) Reducing cardinality: add label_drop or metric_relabel_configs to remove high-cardinality labels before aggregation. (5) Caching repeated sub-expressions: if multiple rules use 'rate(http_requests_total[5m])', create a single intermediate recording rule and reference it. (6) Using Prometheus 2.40+, which has better rule evaluation optimization. (7) For extreme load, use Thanos, Cortex/Mimir, or M3DB, which can distribute rule evaluation.

Follow-up: If a recording rule's query times out (prometheus.yml: evaluation_timeout), how long does evaluation take before it's killed? Can slow rules block other rules?

You're using recording rules to pre-compute SLO metrics. A recording rule computes 'slo:request_success_ratio = request_success / request_total by (service, endpoint)'. This creates 10,000 new time-series. Then you query 'slo:request_success_ratio > 0.95' to find high-performing services. However, the cardinality of the recording output equals the cardinality of raw metrics, defeating the purpose of aggregation. How do you reduce cardinality in recording rules?

Recording rules don't reduce cardinality by default; they replay the cardinality of the input. To reduce cardinality, aggregate aggressively: instead of 'slo:request_success_ratio by (service, endpoint)', use 'slo:request_success_ratio by (service)' (drop endpoint). Or aggregate further: 'slo:request_success_ratio by (service, region)' (2 dimensions vs 5). For SLO tracking, use: 'slo:request_success_by_service = sum(rate(request_success[5m])) by (service) / sum(rate(request_total[5m])) by (service)'. This produces 1 series per service (~100 series), not 10k. Monitor cardinality of recordings with: 'count(http:requests:rate5m)'. If cardinality is still high, add label dropping: use metric_relabel_configs to drop labels before writing recordings (though this is rarely recommended—better to drop at source). For fine-grained queries (service + endpoint), use Thanos or a metrics warehouse for post-aggregation analysis rather than real-time recording rules.

Follow-up: If you aggressively drop labels in recording rules, can you still drill down in dashboards to service + endpoint level later?

Your recording rules work well for general queries, but alerts need slightly different aggregations (e.g., sum by (instance) for alerting, but sum by (service) for dashboards). You don't want to duplicate rules or have alerts depend on dashboard recordings. How do you manage multiple recording rule flavors for different use-cases?

Create separate recording rule groups for different use-cases. Instead of one 'http:requests:rate5m', create: groups: [ { name: 'http_metrics_dashboard', rules: [ { record: 'http:requests:rate5m_by_service', expr: 'rate(http_requests_total[5m]) by (service)' } ] }, { name: 'http_metrics_alerts', rules: [ { record: 'http:requests:rate5m_by_instance', expr: 'rate(http_requests_total[5m]) by (instance)' } ] }, { name: 'http_metrics_slo', rules: [ { record: 'http:requests:success_ratio', expr: 'sum(rate(...[5m])) by (service) / sum(rate(...[5m])) by (service)' } ] } ]. Each group can have a different interval (dashboard: 1m, alerts: 15s, SLO: 5m). This approach adds CPU overhead (replicated evaluations) but provides flexibility. Alternatively, create a single "canonical" recording (high resolution, e.g., by (service, instance)), then use that in downstream rules: 'http:requests:rate5m_by_service = sum(http:requests:rate5m) by (service)'. This uses "meta-recordings" (recording rules that reference other recordings).

Follow-up: If meta-recordings (recordings that reference other recordings) fail, can you end up with stale data from the previous evaluation cycle?

You've deployed recording rules and noticed that alerts depending on these rules sometimes don't fire because the recording rule evaluation is slower than alert evaluation. A critical alert depends on a recording rule that takes 10 seconds to evaluate, but the alert evaluates every 15 seconds. The alert might fire before the latest recording is available. How do you ensure consistency between recording rules and dependent alerts?

Recording rules and alert evaluation are decoupled; they can race. Alert uses the latest available time-series at evaluation time, which might be stale if the recording hasn't finished yet. Ensure consistency with: (1) Recording rule interval <= alert evaluation interval. If alert evaluates every 15s, ensure recording rules evaluate every 10s or faster. (2) Use dependencies via rule groups: put recording rules and dependent alert rules in the same 'ruleGroup' with 'ordering': when evaluated together, recordings finish first, then alerts use the fresh data. However, Prometheus doesn't have explicit rule ordering; use separate config files with numeric prefixes (01-recordings.yml, 02-alerts.yml) evaluated in order. (3) Set recording rule evaluation_timeout low (5s) to avoid blocking alerts. (4) Use time-series cache: Prometheus caches instantVector results, so if a recording rule evaluated 2s ago and an alert queries the same expression, it uses the cached result. (5) For mission-critical alerts, avoid recording rule dependency; compute the expression directly in the alert rule to ensure it's always fresh.

Follow-up: If a recording rule depends on a metric that has an up to 5-minute staleness (metric from a batch job), how do you prevent stale data from triggering false alerts?

You're monitoring a large multi-tenant platform with 1,000 customers. Each customer has slightly different SLA thresholds. You could create 1,000 recording rules (one per customer), but that would be inefficient. How do you scale recording rules for multi-tenancy?

Creating 1,000 individual recording rules is unsustainable (1,000 * num_rules evaluations per interval). Instead: (1) Use label-based aggregation: add 'customer_id' label to all metrics. Create a single recording rule: 'slo:request_latency_p99 = histogram_quantile(0.99, rate(http_request_duration_bucket[5m])) by (customer_id, service)'. One rule, N outputs (one per customer_id). (2) Query recording rules with label filters: dashboard query 'slo:request_latency_p99{customer_id="CUST_123"}' returns only that customer's SLO. (3) For per-customer alerting thresholds, use alert rule expressions that compare SLOs against per-customer thresholds: 'slo:request_latency_p99 > on(customer_id) group_left(threshold) customer_thresholds'. However, this requires a customer_thresholds metric. (4) Alternatively, use Mimir or Cortex with per-tenant rule evaluation and storage, eliminating the need to manage cardinality. (5) For extreme scale (10k+ customers), use a metrics warehouse or distributed backend instead of trying to scale Prometheus.

Follow-up: If you have 1,000 customers each with their own on-call schedule, and you need to route alerts differently per customer, how do you implement this without creating 1,000 rule groups?

You've set up recording rules to cache expensive queries, but you notice that a dashboard query runs 10,000 separate PromQL queries (one per time-step in a graph) because Grafana doesn't understand that a recording rule is pre-computed. Each query takes 100ms, totaling 1,000 seconds. How do you optimize Grafana's interaction with recording rules?

Grafana's instant-query mode runs one query per time-step, resulting in many round-trips. Optimize by: (1) Using range queries in Grafana dashboards (matrix queries return all time-steps at once). In Grafana panels, use 'range_step: 30s' to fetch data efficiently. (2) Ensuring recording rules have low cardinality output, so each query returns quickly. If a recording rule returns 1M series, Grafana will be slow regardless. (3) Increasing Prometheus's cache (prometheus.yml: query.cache_value_ttl, query.cache_max_memory_size) to avoid re-evaluating the same query. Default 10MB cache; increase to 100MB+ for dashboards with many panels. (4) Using Grafana's native caching: Grafana caches query results for the dashboard's time range. (5) Pushing aggregation to the recording layer: instead of Grafana aggregating 10k series in the browser, use recording rules to pre-aggregate to 100 series, then Grafana's rendering is instant. (6) Using Grafana plugins or alerting rules to compute complex dashboards periodically and cache results in a custom data source.

Follow-up: If Grafana's query cache is enabled but the Prometheus backend state changes (new targets discovered), can Grafana serve stale cached data indefinitely?