Prometheus Interview — Federation and Global View

Your organization runs separate Prometheus instances across three datacenters (US-East, US-West, EU). Engineers need a global dashboard showing the union of all metrics. However, running queries manually on each instance is tedious. How do you implement federation to expose a unified metrics endpoint?

Set up a global Prometheus instance that scrapes a /federate endpoint from each regional instance. The /federate endpoint returns a subset of metrics specified by match[] query parameters. In the global Prometheus scrape config: scrape_configs: [{ job_name: 'federation-us-east', params: { 'match[]': ['up', 'node_cpu_seconds_total'] }, static_configs: [{ targets: ['prometheus-us-east:9090'] }] }]. Add external_labels to each regional Prometheus (via --external.labels or config) to tag metrics by region: external_labels: { region: 'us-east' }. These labels pass through the /federate endpoint and flow into the global Prometheus. Now query the global instance: up{region="us-west"} shows metrics only from US-West. The /federate endpoint doesn't store time-series data; it's a pull-through gateway that fetches from regional instances' RAM-based TSDB at query time.

Follow-up: If the /federate endpoint is called with no match[] parameters, what metrics are returned? How do you prevent federation from accidentally exposing sensitive internal metrics?

You've set up federation, but queries are slow: a global query like 'rate(http_requests_total[5m])' across three regional instances takes 45 seconds. The query itself is simple, but federation is a bottleneck. How do you optimize the federation layer and query performance?

Federation is a pull-through query gateway, not a cache. Each /federate scrape re-evaluates the query against the regional Prometheus's current data. Optimize by: (1) Pre-aggregate metrics at the regional level using Prometheus recording rules. Add recording rules that compute rate(http_requests_total[5m]) as a new metric global:http_requests:rate5m, then federate only the pre-computed metric, not the raw data. (2) Reduce cardinality: federate only labels necessary for the dashboard (drop high-cardinality labels). (3) Increase federation scrape_interval to 1-2 minutes (trades freshness for query speed). (4) Implement read replicas or a distributed query engine (Thanos, Mimir) that handles multi-instance queries efficiently. Mimir's querier distributes queries to ingestors in parallel and returns results in seconds. (5) Use Prometheus's global query cache if available (experimental feature).

Follow-up: If a regional Prometheus goes down, how does the global instance handle federated scrapes? Does it cache stale data, or does the query fail for that region?

Your company has a parent-child Prometheus hierarchy: global (parent) scrapes regional (children), regional scrapes leaf instances. A metric is duplicated at each level, and you're querying the global instance but getting inconsistent aggregations. How do you ensure consistency across the hierarchy?

Multi-level federation introduces duplication: the same metric (e.g., http_requests_total) exists at leaf, regional, and global levels. To maintain consistency, implement tiered aggregation: at the leaf level, expose raw metrics. At the regional level, expose pre-aggregated metrics (recording rules) that sum across leaves, not raw metrics. At the global level, federate only the regional aggregations, not raw leaf metrics. For example: leaf Prometheus (via recording rule): region:http_requests:sum = sum(http_requests_total) by (region). Regional Prometheus federate only 'region:http_requests:sum', not 'http_requests_total'. Global Prometheus scrapes the regional Prometheus and gets pre-aggregated sums. Query the global instance: sum(region:http_requests:sum) gives the true global total, not a triple-count. Also, use __name__ and match[] carefully: scrape /federate?match[]=region:* to get only regional aggregations, excluding raw leaf data.

Follow-up: How does the scrape_interval at each level affect aggregation consistency? If leaf instances scrape every 15s, regional every 1m, and global every 5m, can you end up with "stale" aggregations?

You're using Prometheus federation across 5 datacenters, but compliance requires that metrics for customer data are never transmitted across datacenters—only aggregated summaries can leave. How do you enforce data locality in federated queries?

Implement two-tier federation: local Prometheus instances run within each datacenter with raw metrics (including PII or sensitive data). Each datacenter's Prometheus has recording rules that aggregate sensitive data into summary metrics (cardinality-reduced, no PII). These summary metrics are tagged with a compliance label or separate job. The /federate endpoint is configured to only expose summary metrics, blocking raw data: match[] query parameters explicitly list allowed recording rule outputs (e.g., match[]=customer_revenue_sum, match[]=request_count_by_region) but not raw 'customer_*' metrics. Global Prometheus scrapes only these summary metrics. Implement Prometheus's remote_read and remote_write with filtering to automatically strip sensitive labels. Use eBPF or proxy rules to block /federate requests for raw customer metrics. For stronger isolation, run separate Prometheus instances: one for local ops (all metrics), one for federation (summary only).

Follow-up: If someone queries the federated Prometheus with a PromQL query like 'count({{ \"{{\" }}.*customer.*{{ \"}}\" }})' to discover sensitive metrics, how do you prevent this?

Your global Prometheus federation is set up, but you notice that alerts defined on the global instance sometimes fail because metrics have different label sets across regions (US instances have 'az' label, EU instances have 'zone' label). How do you handle label schema inconsistencies across federated regions?

Label schema inconsistencies cause queries to return partial results or fail across regions. Solutions: (1) Standardize labels at the source: enforce a label naming convention (e.g., always 'availability_zone', never 'az' or 'zone'). At each regional Prometheus, add relabeling rules to rename incoming labels to the standard set. Use metric_relabel_configs in scrape config. (2) Add recording rules that normalize labels: eu_http_requests{az="", zone=""} = http_requests_total from EU, us_http_requests{az="", zone=""} = http_requests_total from US. Then federate the normalized metric. (3) In global PromQL queries, use label_join() or label_replace() to unify: label_replace(metric, "availability_zone", "$1", "az|zone", "(.*)") coalesces 'az' or 'zone' into 'availability_zone'. (4) Add external_labels at each regional Prometheus to inject region-specific label naming, then query using bool conditions: {region="us-east", az!=""} OR {region="eu", zone!=""}

Follow-up: If you have 100 services and each defines labels inconsistently, how do you enforce schema without redeploying every service?

You're implementing a "read-only" global Prometheus that scrapes metrics from three regional instances for display in a public dashboard. However, the global instance needs to be read-only (no writes, no changes), but regional instances sometimes have stale or missing data. How do you handle consistency and staleness in a federated read-only setup?

In a federated setup, the global Prometheus is indeed read-only at the query API level (no /api/v1/admin endpoints enabled). However, staleness is inherent to pull-based federation: if a regional Prometheus hasn't been scraped recently or has staleness issues (e.g., target down), the federated data reflects that staleness. Mitigate with: (1) Increase keep_firedfor on alerts in the global instance to tolerate transient gaps. (2) Implement keep_firedfor: 5m in alert rules so alerts don't resolve immediately on stale data. (3) Use on(instance) group_left() to correlate with 'up' metrics and filter out stale data: rate(http_requests_total[5m]) and on(instance) (up == 1). (4) At the global level, use queries that skip missing data: sum without() on multiple time-series handles partial outages gracefully. (5) For critical dashboards, add red/yellow status indicators for regions with stale data (check metric age using 'time() - timestamp(metric)').

Follow-up: If a regional Prometheus goes completely offline, how long before the global instance detects it's missing data, and how do you alert on this loss of federated data?

You've built a global Prometheus federation layer, but it's now a critical single point of failure: if the global instance is down, dashboards show no data. You can't query individual regional instances (too many URLs), and Grafana dashboards are hardcoded to the global URL. How do you add HA to the federation layer?

Make global Prometheus HA by running multiple replicas behind a load balancer (e.g., Nginx, HAProxy). Each replica scrapes the same regional Prometheus instances independently, stores metrics locally, and exposes the same query API. From Grafana's perspective, the load balancer is a single URL. However, each replica has its own time-series database and can answer queries independently. For consistency, ensure all replicas have the same scrape config and retention policy. Alternatively, use a distributed metrics backend: migrate from Prometheus federation to Mimir (scales horizontally with distributor/querier/ingester architecture). Mimir handles replication automatically. Or use Thanos: add object storage (S3), deploy Thanos sidecars on regional Prometheus instances (upload blocks to S3), deploy Thanos querier (HA by default, queries all sidecars in parallel). Thanos is built for federation and HA.

Follow-up: If you run multiple global Prometheus replicas, how do you handle cardinality explosion when the same metrics are ingested multiple times?

You've set up federated Prometheus, but queries are returning results from stale data. For example, a 5-minute-old metric is being returned even though the regional instance has fresher data. How do you control data freshness in federated queries?

Staleness in federation is controlled by the scrape_interval at the global level and the retention at the regional level. When the global Prometheus scrapes the /federate endpoint at regional-instance:9090/federate, it fetches the latest data available in the regional instance. However, if the regional instance itself has old data (because its targets are down or slow), the federated data is old. Check freshness with two strategies: (1) Query age: metric_age = time() - timestamp(metric) in the global instance. A metric aged > 5 minutes indicates staleness. (2) Use 'up' metric correlation: filter by {up == 1} to ensure targets are active. For stricter freshness guarantees, reduce scrape_interval at the global level (higher load) or use queries like: metric > 0 and (time() - timestamp(metric)) < 300. To debug staleness, query the 'scrape_duration_seconds' metric at the regional level to see if scrapes are slow or failing. If regional Prometheus has a 1-hour retention but global scrapes every 5m, federated data can be up to 1 hour stale.

Follow-up: How does the --query.lookback-delta flag affect federation? Can it mask staleness issues?