Prometheus Interview — PromQL Fundamentals and Advanced Queries

Dashboard spike in p99 latency at 14:32 UTC. Correlate with query rate changes across 50+ dashboards to find resource-heavy PromQL queries.

Use multi-step approach: First examine Prometheus server metrics like `prometheus_tsdb_symbol_table_size_bytes` and `prometheus_query_duration_seconds`. Query volume via `rate(prometheus_http_requests_total{handler="query"}[5m])`. Use regex-based metric selectors like `{__name__=~"metric_.*",job="prod"}` with aggregation. Enable query logging with `--web.enable-admin-api` and use `/api/v1/admin/tsdb/stats` endpoint to get cardinality stats.

Follow-up: How would you optimize that PromQL query if it times out on a 1TB dataset?

Customer's alert isn't firing for a trending metric. PromQL uses 5-min lookback but suspect staleness or data gaps. Debug this.

Check staleness with `absent(metric_name)` which returns 1 if no data in last 5 minutes. Examine raw data over increasing time windows to spot gaps. Use `increase()` vs `rate()` differences to detect scrape issues. Enable `--query.max-samples=100000000` temporarily. Check target health page directly.

Follow-up: What's the safest way to adjust the global staleness threshold in production?

Optimize complex dashboard with 200+ queries per page load. Nested aggregations with 10+ label cardinality levels. Performance degrading.

Implement tiered query strategy: (1) Pre-compute expensive aggregations using recording rules every 30s. (2) Use binary search on time ranges—queries >30 days use coarser resolution via Thanos. (3) Reduce label dimensionality via relabeling. (4) Implement query result caching at Grafana level with 1-5m TTL. (5) Split long queries across multiple shorter windows.

Follow-up: How would you implement automated query optimization detection?

Financial trading system needs real-time PromQL queries on high-cardinality data (tickers, exchanges, timestamps). <200ms latency required. Queries timeout.

Multi-layer architecture: (1) Use Thanos or Mimir for distributed querying with federation across multiple Prometheus shards. (2) Implement vector matching efficiently with `on()` and `ignoring()` to avoid cross-product explosions. (3) Use pre-computed recording rules for common queries. (4) Query pushdown: move operations to scrape time. (5) Deploy dedicated query nodes with higher resources.

Follow-up: How would you handle backpressure when query volume spikes during market hours?

Migrate 500 dashboards from Graphite to Prometheus. PromQL syntax differs significantly. How do you validate correctness of converted queries?

Build multi-stage validation: (1) Create translation layer mapping Graphite functions to PromQL. (2) Run side-by-side validation on overlapping ranges, compute correlation (target >0.95). (3) Use `--query.timeout` for strict limits during validation. (4) Create unit tests using Prometheus test utility with YAML test files. (5) Run canary: deploy 5% of dashboards to converted queries, monitor divergence.

Follow-up: What monitoring would you set up to catch query regressions automatically?

Build SLO dashboard computing 99.9th percentile latencies across 50 microservices with 100+ instances. Computing live would be expensive.

Implement hierarchical recording rules: (1) Use histogram buckets efficiently from scrape time. (2) Create level-1 rules aggregating to service level with `histogram_quantile(0.999, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))` run every 30s. (3) Create level-2 rules for global aggregates. (4) Use exponential histograms for precision. (5) Store results in low-cardinality metric: `slo_latency_p999{service="api"}`.

Follow-up: How would you handle inaccuracy from long-tail latencies?

Bug causes metric export with 10x cardinality for 30 minutes (100M → 1B series). Recovery takes 4 hours. How do you prevent this architecturally?

Implement multi-layer cardinality guards: (1) At scrape level, add hard limits in `prometheus.yml` via `metric_relabel_configs`. (2) Use metric-level limits via custom scrapers. (3) Deploy cardinality monitoring using `prometheus_tsdb_symbol_table_size_bytes`, alert at 80% capacity. (4) Implement circuit breakers: >50% growth in 5 min triggers auto-disable. (5) Use Thanos/Cortex with tenant-level cardinality limits. (6) Enable `--storage.tsdb.max-block-duration` to force compaction.

Follow-up: How would you design automated rollback for scrapers violating cardinality contracts?

Monitoring layer needs to alert on data exfiltration/abuse (querying millions of series). /api/v1/query: 50k req/s. Design rate limiting and abuse detection.

Defense-in-depth: (1) Reverse proxy rate limits via Nginx: 1000 req/s per IP, burst 5000. (2) Token-based quotas: "10k queries/min, 1M series max". (3) Query cardinality estimation before execution. (4) Log and alert on `prometheus_http_requests_total` spikes. (5) Result set limits: `--query.max-samples=1000000` with `--query.timeout=30s`.

Follow-up: What zero-trust architecture prevents insider threats?