Prometheus Interview — Staleness Handling and Data Gaps

Your application updates a metric every 60 seconds, but your Prometheus scrape_interval is 15 seconds. Between updates, the metric is stale (hasn't changed). How does Prometheus handle stale data, and how does this affect queries and dashboards?

Prometheus marks a metric as stale (stale marker, internal NaN value) when data stops arriving for longer than the staleness threshold (default 5 minutes, --query.lookback-delta). Behavior: (1) If a scrape succeeds but a metric is missing from the output, Prometheus marks it stale for 5 minutes. (2) Queries for stale data return no results or use the last known value (depending on query type). Instant queries skip stale values. Range queries interpolate or use previous value if available. (3) For your 60-second metric with 15-second scrape intervals: if a metric appears at t=0, t=60, t=120, then at t=15, t=30, t=45, t=75 the metric is missing. Prometheus marks it stale at t=65 (60 + 5m lookback). Queries for t=30 return no result (stale). (4) To prevent staleness, ensure metrics update frequently: either increase scrape frequency (scrape every 30s instead of 60s) or use --query.lookback-delta to increase staleness threshold to 10m. (5) In dashboards, use 'no data' handling: Grafana can fill stale gaps with 'null', 'last value', or 'constant value'. (6) For alerting, use 'for' clause to tolerate brief staleness: 'alert if metric > 100 for 2m' won't fire if metric is stale for < 2m.

Follow-up: If a metric stops reporting entirely (target down), does Prometheus mark it stale after 5 minutes or immediately?

You're querying HTTP request latency (rate(http_requests_total[5m])) across 10 services, but some services had a deployment and temporarily had no traffic. You see gaps in the graph for those services. How do you handle gaps and missing time-series in multi-service queries?

Gaps occur when time-series are intermittently absent (stale or target down). Handling depends on query type: (1) Instant queries (e.g., 'up') return no result for missing series. (2) Range queries (e.g., 'rate(...[5m])') return NaN for time-steps without data. (3) Aggregation queries (e.g., 'sum(rate(...[5m]))') skip missing series and sum only present ones. If 10 services but 2 are missing, result is sum of 8. This can cause dashboards to show incorrect totals. (4) To fill gaps: use 'fill(strategy)' in PromQL queries: 'rate(http_requests_total[5m]) | fill(previous)' carries forward the last value. 'fill(0)' fills with 0. (5) Use 'unless' to detect gaps: 'rate(...[5m]) or on(...) series_which_should_exist'. This alerts when expected series are missing. (6) For multi-series aggregations, normalize before summing: use 'clamp_max' or 'clamp_min' to replace NaN: '(rate(...[5m]) or on(...) 0) | clamp_min(0)'. (7) In Grafana, configure panel fill settings: 'Fill' option can set 'null', 'previous', or 'auto'. (8) Use alert 'for' clauses to tolerate brief gaps: 'rate(requests[5m]) < 10 for 5m' won't fire if data is missing for only 1-2 minutes.

Follow-up: If you use 'fill(previous)' on a metric that's NaN for the entire range, what does it return?

Your alert is triggering on stale data: the metric hasn't reported for 10 minutes (past the 5-minute staleness threshold), but your alert 'up == 0 for 5m' is firing. Is the alert incorrect, or is there a stale data issue in Prometheus alerting?

Alerting behaves differently with stale data. When a metric becomes stale (> 5m without update), alert expressions can still reference it if there's a cached value. However, stale data handling in alerts: (1) Instant alert queries (evaluated every 15-30s) skip stale series. If 'up' is stale for > 5m, 'up == 0' returns no result (not 0, but absence). (2) If no result, the alert's 'for' clause doesn't increment (no match = no hold state). (3) Your alert 'up == 0 for 5m' fires only if 'up == 0' matches consistently for 5 minutes. If 'up' is stale (no match), the alert shouldn't fire. However, edge cases: (a) If up metric was previously 0, and then becomes stale, the cached value might still be 0. Prometheus might use cached data. (b) Or, if the target is down, the 'up' metric itself might not be stale—Prometheus actively sets up=0 when a scrape fails. (4) To debug: query the Prometheus API directly and check if 'up' returns 0 or NaN: 'curl http://prometheus:9090/api/v1/query?query=up'. If it returns 0 (not NaN), the alert is correct. If NaN, investigate why it became stale. (5) Check scrape status: 'curl http://prometheus:9090/api/v1/targets' shows target health. If a target is 'Down', 'up' metric should be set to 0, not stale.

Follow-up: If a target is down but has cached up metric = 0, how long does Prometheus keep the cached value before treating it as stale?

You're implementing a recording rule for 'service_availability = count(up{job="api"} == 1) / count(up{job="api"})' to compute service availability. However, when one instance is down, the denominator changes (count(up{job="api"}) decreases), making the availability percentage artificially higher. How do you handle missing instances in availability calculations?

Using dynamic counts (count() on live series) causes availability to increase when instances disappear, which is incorrect. Instead: (1) Use a static denominator: define the expected number of instances explicitly. Create a metric expected_instances = 3 via a constant-value exporter or recording rule: 'record: expected_api_instances, expr: 1 and count(up{job="api"}) > 0' (evaluates to 1, then multiply by 3 in queries). (2) Better: use 'count() with default 0': 'count(up{job="api"} == 1) / on() count(up{job="api"} == 1 or up{job="api"} == 0)'. This ensures the denominator includes down instances. However, if an instance disappears entirely, you lose it. (3) Best: use a service metadata metric: create a metric like 'service_instances{job="api", instance="i1"} = 1' for each known instance. Then: 'count(up{job="api"} == 1) / count(service_instances{job="api"})'. The denominator is the count of known instances, regardless of current state. (4) Implement via: (a) Kubernetes: use PodMonitor with discovered pods. (b) Static targets: define all targets in prometheus.yml and use external labels to tag them as "expected". (c) File SD with a static list. (5) For multi-region or autoscaling environments, update the metadata metric dynamically via a script that queries your CMDB or ASG.

Follow-up: If you have 100 instances, but only 50 are running, and you calculate availability as (50 / 100), is that SLO-appropriate or should you count only running instances?

Your Prometheus instance scraped a target for 5 minutes, but the network went down and the target became unreachable for 1 hour. After the network recovered, scraping resumed. What traces of the 1-hour gap remain in Prometheus, and how do queries handle this gap?

During the 1-hour network outage, Prometheus marked the target as down (up=0) or stale, depending on the implementation. After recovery: (1) Stale markers: if Prometheus marks the metric stale (internal NaN) for the duration, re-scraping sets new data. Queries see a gap (NaN) from ~t+5min to ~t+65min. (2) Alternatively, if Prometheus updates 'up=0' during the outage (actively scraping and failing), 'up' remains 0 for 1 hour, then jumps back to 1. (3) For other metrics (not 'up'), Prometheus doesn't actively generate data during outage. The last scraped value (from before the outage) persists for up to 5 minutes (staleness threshold), then is marked stale. (4) Queries during the gap: range queries (e.g., 'rate(...[5m])') return NaN for time-steps without data. 'rate(requests_total[5m])' evaluated at t=30min (during outage) returns NaN. (5) After recovery at t=65min, the new data flows in. Queries now have a gap: t=0-5 (normal), t=5-65 (stale/missing), t=65+ (new data). (6) To visualize: in Grafana, the graph shows a gap or line break during the outage. (7) Alerts with 'for' clause handle gaps gracefully: 'alert if metric < 10 for 5m' doesn't increment during stale time, so alerts don't trigger during network recovery. (8) To prevent false alerts on recovery, use 'ifnone(metric, default)' or 'on_missing()' to fill gaps with default values.

Follow-up: If network is down for 1 hour but metrics are cached locally and Prometheus continues recording data (from cache or local state), can you lose this cache if Prometheus crashes during the outage?

You're building an alert: 'fire if requests_total hasn't increased in 1 hour' (to detect when application stops working). However, your query 'increase(requests_total[1h]) == 0' sometimes doesn't fire when it should. Why does staleness affect change-detection queries?

Change-detection queries (increase, rate, diff) rely on consecutive samples. Stale data breaks these queries: (1) 'increase(requests_total[1h])' returns NaN if: (a) requests_total is stale (hasn't been scraped for > 5m). (b) Fewer than 2 samples exist in the 1-hour window. (2) If the application stops working and requests_total stops being scraped, Prometheus marks it stale after 5 minutes. At t=6m, 'increase(requests_total[1h]) == 0' is evaluated, but requests_total is stale (NaN), so the query returns NaN (not 0). NaN == 0 is false, so alert doesn't fire. (3) Fix: use 'change(requests_total[1h]) or 0' to replace NaN with 0: 'change(requests_total[1h]) or 0'. Now alert fires if change is 0 or if requests_total is stale. (4) Or, monitor staleness directly: 'time() - timestamp(requests_total) > 300' fires if metric is older than 5 minutes. (5) For application crash detection, also monitor the exporter health: 'up{job="app"} == 0' fires immediately when scrape fails, not after 5m staleness. (6) Combine both: 'alert if (increase(requests_total[1h]) == 0 or up{job="app"} == 0) for 2m'. This catches both scenarios (no change or target down). (7) In Prometheus 2.39+, use the experimental '--enable-feature=timestamps' to improve timestamp handling with staleness.

Follow-up: If you have a metric that updates every 5 minutes (batch job), and staleness threshold is 5 minutes, does increase() work for 1-hour windows?

You're implementing a dashboard that shows API latency over the last 7 days. However, you notice the dashboard graph has gaps for specific hours when the service was redeployed (scraping was paused). You want to fill these gaps to show a continuous line. What strategies work for gap-filling in long-term dashboards?

Gap-filling strategies depend on use-case and gap type: (1) Known gaps (deployments, maintenance windows): use 'fill(previous)' to carry forward the last value: 'rate(...[5m]) | fill(previous)'. This makes the graph continuous. For dashboards: 'Graph panel' → 'Display' → 'Fill' option → select 'Previous'. (2) For unavoidable gaps (genuine outages), accept the gap or fill with 0: 'rate(...[5m]) or on(...) 0'. (3) For SLO dashboards, use threshold-based filling: if metric is missing, assume it failed SLO and fill with a penalty value. Example: 'rate(...[5m]) or on(...) -1' (fill with -1 to indicate missing). (4) Recording rules with default values: create a recording rule that always outputs a value (never stale): 'record: api_latency_with_default, expr: rate(http_duration[5m]) or on() 0'. This ensures no gaps. (5) For long-term (7-day) dashboards, increase Grafana's range_step parameter to reduce query granularity. Example: Grafana queries at 1-minute resolution for 7-day graph by default, which can cause gaps. Set range_step: 1h to query hourly, reducing gaps from missing minutes. (6) Use Thanos or remote storage with downsampling: old data is downsampled (averaged) but has no gaps. 7-day queries from downsampled data are complete. (7) Accept gaps philosophically: if data is genuinely missing, showing a gap is more honest than filling. Use Grafana's 'Null handling' or 'Missing data' settings to configure visibility.

Follow-up: If you fill a 1-hour gap with 'fill(previous)', and the previous value was from 7 days ago, is the filled value meaningful?

You have a metric that spikes occasionally (error_rate_high), triggering an alert 'error_rate > 50% for 1m'. However, if the metric is stale (> 5m without update) and then recovers (new scrape shows error_rate = 5%), the alert doesn't resolve immediately because it was never evaluated during staleness. How does alert state persist across stale data?

Alert state depends on whether the condition matches during evaluation, regardless of staleness. When a metric becomes stale: (1) The alert expression returns no results (series disappears). (2) If the alert was previously 'firing' (condition matched), it transitions to 'inactive' or stays pending if the 'for' clause is still counting. (3) When the metric returns with data: if the condition matches, the alert re-enters 'pending' (holding time for 'for' clause). If the condition doesn't match, the alert resolves. (4) In your scenario: error_rate > 50% for 1m fires at t=0. At t=65s, metric becomes stale (no new samples). The alert's 'for' clause stops incrementing (no matching data). The alert might stay in 'firing' state depending on Prometheus version and whether it caches the alert state. (5) When metric recovers at t=4min with error_rate=5%, it doesn't match the condition, so the alert resolves immediately. (6) To prevent stale-induced weirdness: use 'keep_firing_for' (Prometheus 2.45+): 'alert: name, expr: ..., for: 1m, keep_firing_for: 5m'. Even if metric becomes stale, alert stays firing for up to 5m, then automatically resolves. (7) Or explicitly handle staleness: add a condition 'and (time() - timestamp(...) < 300)' to the alert, so it only fires if data is fresh. (8) Best practice: use 'up' metric to detect target health alongside your alert condition.

Follow-up: If an alert has 'for: 5m' and the metric becomes stale for 10m, then recovers, does the 'for' counter reset or does the alert fire immediately?