Prometheus Interview — Pushgateway: Use Cases and Anti-Patterns

You have batch jobs that run every 6 hours and report completion status. The jobs are ephemeral (destroyed after completion) so Prometheus can't pull metrics via scrape. How do you use Pushgateway to capture metrics from batch jobs?

Pushgateway is a metrics sink for jobs that can't be scraped (ephemeral, batch, private networks). In your case: (1) Batch job runs, completes, and before exiting, pushes metrics to Pushgateway: 'curl -X POST --data-binary @metrics.txt http://pushgateway:9091/metrics/job/batch_job_6h'. (2) Prometheus scrapes Pushgateway (not the job directly) at regular intervals (every 30s). Pushgateway stores metrics in memory, indexed by job and instance labels. (3) Metrics from the job are retained in Pushgateway until manually deleted or TTL expires. By default, Pushgateway keeps metrics indefinitely, so a job that ran 6 hours ago still appears as current. (4) Configure Prometheus to scrape Pushgateway: scrape_configs: [ { job_name: 'pushgateway', honor_labels: true, static_configs: [ { targets: [ 'pushgateway:9091' ] } ] } ]. (5) Job metrics include a 'pushed_time' timestamp, so you can identify stale metrics. (6) To clean up, either manually delete via API: 'curl -X DELETE http://pushgateway:9091/metrics/job/batch_job_6h' or let Pushgateway expire metrics after TTL (use 'honor_labels: true' so Prometheus doesn't override job label). Recommendation: for batch jobs, use Pushgateway for important completion metrics (success/failure, runtime), but don't rely on it for continuous monitoring.

Follow-up: If a batch job crashes and never pushes metrics, how long does Pushgateway keep the old metrics from the previous run?

Your team is using Pushgateway for application metrics in production. Multiple app instances push metrics every minute. However, Pushgateway is a single point of failure. If it goes down, Prometheus has no metrics and alerts fire. How do you add HA to Pushgateway?

Pushgateway is stateless but stores metrics in memory. Multiple Pushgateway replicas have separate state (no replication), so they're independent single points of failure. True HA requires: (1) Load balancer in front of multiple Pushgateway replicas. Apps push to the load balancer (e.g., Nginx, HAProxy), which routes to one replica. If one replica is down, requests go to another. However, each replica has its own metrics—Prometheus sees different data from different replicas. (2) Prometheus scrapes all replicas behind the load balancer and deduplicates. Example: scrape_configs: [ { job_name: 'pushgateway_replica_1', static_configs: [ { targets: [ 'pushgateway1:9091' ] } ] }, { job_name: 'pushgateway_replica_2', static_configs: [ { targets: [ 'pushgateway2:9091' ] } ] } ]. Use 'up{job=~"pushgateway.*"}' to query both. (3) For true deduplication, configure Prometheus to prefer one replica: use 'honor_labels: true' and add an instance label to distinguish replicas. Prometheus deduplicates by job + labels, so if both replicas have the same job label, they'll overwrite each other. Add external_labels to each replica to distinguish them. (4) Alternatively, use a distributed backend (Thanos, Mimir) instead of Pushgateway. (5) For production, it's recommended to avoid Pushgateway for HA critical metrics. Instead, implement a stateful backend or use containers with persistent storage for Pushgateway. For true stateless HA, accept that replicas have separate metrics and use federation or remote_write to merge.

Follow-up: If you have 3 Pushgateway replicas and app 1 pushes to replica 1, app 2 pushes to replica 2, can Prometheus query metrics from app 1?

Your ops team is pushing metrics to Pushgateway for manual tracking (e.g., 'manual_deployment{version="1.2.3"} = 1'). However, Pushgateway is retaining these metrics indefinitely, and old deployments (from months ago) still appear in Prometheus. You want Pushgateway to auto-expire old metrics. How do you implement TTL for Pushgateway metrics?

Pushgateway doesn't have built-in TTL per metric; metrics persist until manually deleted or Pushgateway is restarted. Workarounds: (1) Manual deletion: ops team runs 'curl -X DELETE http://pushgateway:9091/metrics/job/deployment_tracking/instance/v1.2.3' after a set period. Not scalable. (2) Scheduled cleanup: implement a script that queries Pushgateway metrics, identifies old ones (based on 'pushed_time' timestamp), and deletes them: 'for job in $(curl -s http://pushgateway:9091/metrics | grep pushed_time | awk '{print $1}'); do if older than X days; then delete; fi; done'. (3) Prometheus TTL handling: in Prometheus, set a low retention policy (e.g., retention: 30d). Metrics older than 30d are pruned from Prometheus. However, Pushgateway keeps metrics; only Prometheus forgets them. (4) Monitoring staleness: use 'time() - pushed_time > 604800' (older than 7 days) to alert on stale metrics, then manually clean. (5) Best practice: don't use Pushgateway for stateful tracking. Use a proper state store (database, file) and sync with Pushgateway as needed. For deployment tracking, export a metric with a timestamp, then let Prometheus scrape it and handle TTL. (6) For batch/event metrics, implement a wrapper: app pushes to an internal queue, a background service batches updates and pushes to Pushgateway, ensuring fresh data. (7) Consider alternatives: use event storage systems (ELK, Splunk) for tracking instead of metrics.

Follow-up: If Pushgateway keeps metrics indefinitely and you have 100k metrics, what happens to memory usage over time?

Your developers are using Pushgateway incorrectly: they're pushing metrics for long-running services (web servers, databases) every second, effectively using it as an exporter replacement. Prometheus is scraping Pushgateway continuously, and you see duplicate metrics and performance issues. How do you prevent Pushgateway misuse?

Pushgateway is for batch/ephemeral jobs, not continuous services. Misuse patterns and fixes: (1) Long-running services: these should expose metrics via HTTP endpoint. Use service discovery + scraping, not Pushgateway. Educate teams: "If your service runs for > 1 hour, use scrape, not push." (2) Duplicate metrics: if service also exports metrics directly (port 8080) and pushes to Pushgateway (port 9091), Prometheus scrapes both, doubling data. Fix: either scrape the service directly, or use Pushgateway only. (3) Push frequency too high: apps pushing every second create high load on Pushgateway and network. Set policy: batch jobs push every 1 hour (at completion), not every second. Use 'groupingkey' to batch: 'curl -X POST --data-binary @metrics.txt http://pushgateway:9091/metrics/job/myjob/instance/instance1'. Each instance gets one grouping. (4) Implement rate limiting: Pushgateway's push endpoint should have rate limiting to reject excessive pushes. Set policy: max 100 pushes/minute per job. (5) Monitoring: track 'pushgateway_push_total' (push count) and 'pushgateway_tcp_connections_total' to detect abuse. Alert if push rate is unusually high. (6) Define SLA: "Use Pushgateway for batch jobs with infrequent updates (< 1 per minute). For continuous services, use scrape." Provide examples and best practices documentation. (7) Migration: audit current Pushgateway usage and migrate inappropriate services to scrape-based monitoring.

Follow-up: If a service pushes to Pushgateway every second but crashes, leaving Pushgateway with stale metrics, how long do they remain?

You're pushing metrics from 100 batch jobs to a single Pushgateway instance. Each job pushes 1000 metrics every hour. After 6 months, Pushgateway has accumulated ~40M metrics (100 jobs × 1000 metrics × 4380 hours). Memory usage is 50GB and Pushgateway is slow. How do you address Pushgateway scalability?

Pushgateway is not designed for massive scale. Single Pushgateway + unlimited metrics = eventual OOM. Solutions: (1) Implement aggressive cleanup: create a cleanup job that deletes metrics older than X days: 'curl -X DELETE http://pushgateway:9091/metrics/job/{job}/instance/{instance}'. Reduce retention to 7-30 days. (2) Cardinality reduction: instead of pushing 1000 metrics per job, push only essential metrics (success/failure, runtime). Drop high-cardinality debug metrics. (3) Sharding: run multiple Pushgateway instances, each handling a subset of jobs. Use a load balancer with consistent hashing: job 1-50 → Pushgateway 1, job 51-100 → Pushgateway 2. Prometheus scrapes both: scrape_configs: [ { job_name: 'pushgateway', static_configs: [ { targets: [ 'pushgateway1:9091', 'pushgateway2:9091' ] } ] } ]. (4) Batching: group related jobs' metrics into a single push: push job1_metrics + job2_metrics together, reducing overall push count and storage. (5) Migration: move from Pushgateway to a proper time-series database (Mimir, Cortex, Thanos). Apps push to Mimir's remote_write endpoint, which is designed for scale. (6) Tiered storage: use a local Pushgateway for recent data (7 days), archive older metrics to cold storage (S3) and query via Thanos. (7) For extreme scale (100M+ metrics), reconsider architecture: batch jobs might not need metrics at all; use event logging (ELK) or structured logs instead.

Follow-up: If you're pushing metrics for 100 different job types and want to clean up old metrics, how do you write a script that handles all variations?

You're monitoring batch job performance with Pushgateway. You push job_duration_seconds and job_success metrics every hour. However, when analyzing SLO (99% of jobs succeed), you notice Prometheus has duplicate metrics from the same job run (pushed twice due to retry logic). How do you handle duplicate pushes in Pushgateway?

Pushgateway deduplicates metrics by job and instance labels. If you push the same metric twice with the same labels, the second push overwrites the first. However, duplicates can occur if: (1) Retry logic causes double-push: app pushes metrics, then retries and pushes again with the same labels. Pushgateway stores the second push. (2) Instance label differs: if retry push includes a different instance label, Pushgateway treats it as a separate metric. Solution: ensure consistent instance labels across retries. Example: use a unique identifier: 'curl -X POST --data-binary @metrics.txt http://pushgateway:9091/metrics/job/batch_job/instance/run_id_12345'. If retried, use the same run_id. (3) Detecting duplicates in Prometheus: query 'count(count by (__name__)(metrics)) > 1' to find metrics with duplicate labels. Alert if duplicates exist. (4) Prevention: implement idempotent pushes: app generates a push_id (hash of job_id + timestamp), and Pushgateway accepts pushes only if push_id is unique. Requires Pushgateway modification or custom wrapper. (5) Alternative: use timestamp in metrics. Push 'job_duration_seconds{timestamp=1620000000} = 30'. Queries can deduplicate by timestamp: 'max by (timestamp)(job_duration_seconds)'. (6) For SLO calculation, tolerate duplicates: if job_success is pushed twice with value 1, the SLO calc still sees success. If duplicates have different values (1 vs 0), investigate the issue. (7) Best practice: implement 'run_id' or 'execution_id' labels to uniquely identify each job run, avoiding duplicate overwrites.

Follow-up: If you push the same metric with the same label but different value (e.g., job_success=1 then job_success=0), which value does Pushgateway store?

You're using Pushgateway for monitoring cron jobs in a distributed system (100 servers, each running 50 cron jobs). Each job pushes metrics when it completes. However, some servers have network issues and fail to push. You need visibility into which servers/jobs failed to push (not just jobs that didn't emit success metric). How do you implement push verification?

Pushgateway doesn't provide built-in acknowledgment for failed pushes. If a push fails, there's no trace in Prometheus. Implement verification with: (1) Emit a heartbeat metric: each cron job, upon start, pushes a 'job_started_at' metric. Upon completion, push 'job_completed_at' or 'job_success'. If job_completed_at is missing but job_started_at is recent, the job is likely stuck or crashed. Query: 'time() - job_started_at > 3600 and job_completed_at == absent' triggers alert on incomplete jobs. (2) Use negative metrics: if job fails to push (network down), push a 'job_push_failed' metric. Separate push failure from job failure. This requires the push mechanism to handle failures (e.g., retry queue, local caching). (3) Implement a push client library with retry logic: wrap Pushgateway pushes in a client that: (a) pushes with exponential backoff, (b) logs failed pushes to a local file, (c) periodically re-attempts failed pushes. (4) Use Prometheus remote_write instead of Pushgateway: cron jobs output metrics to a file or HTTP endpoint, and a local Prometheus agent (Telegraf, Fluentbit) forwards to a central backend via remote_write. This provides built-in retry logic. (5) For distributed systems, consider a sidecar agent (e.g., Node Exporter with a custom collector) that runs on each server and collects cron job status, then Prometheus scrapes it directly. (6) Monitoring: track 'pushgateway_http_request_duration_seconds' for push latency. Alert if latency spikes (indicates network issues). Also track 'job_push_failed' metric if implemented.

Follow-up: If 100 servers all fail to push simultaneously due to network partition, and you're using heartbeat metrics, how do you distinguish network failure from actual job crashes?

You're implementing a custom monitoring dashboard for batch jobs using Pushgateway metrics. You want to show job execution timeline (start → duration → completion) for the last 7 days. However, Pushgateway only stores current metrics, not historical timelines. How do you track historical job execution with Pushgateway?

Pushgateway is real-time only; historical data is stored in Prometheus. To track timelines: (1) Push timestamped metrics: instead of 'job_duration_seconds = 30', push 'job_duration_seconds{completed_at=1620000000} = 30'. Include a timestamp label. Prometheus scrapes and stores this metric with the timestamp label. Query historical data: 'job_duration_seconds{completed_at > 1619395200}' returns all completions after a date. (2) Use Prometheus recording rules to aggregate job metrics over time: 'record: job_completions_per_day, expr: count(increase(job_success[1d])) by (job, day)'. This pre-aggregates completions by day for efficient historical queries. (3) For timelines, track multiple attributes: push 'job_start_time', 'job_end_time', 'job_duration', 'job_success' with the same completion timestamp. Queries like 'job_start_time{completed_at > 1619395200}' retrieve all start times for completed jobs. (4) Implement a custom schema: use Pushgateway groupingkeys to encode job metadata: 'curl -X POST --data-binary @metrics.txt http://pushgateway:9091/metrics/job/batch/instance/job_id_12345/starttime/2021-05-04T10:00:00'. This allows querying by job metadata. (5) Export to a time-series database: periodically export Pushgateway metrics to a warehouse (ClickHouse, TimescaleDB) for long-term analysis and complex queries. (6) For production job tracking, use a proper job scheduling system (Kubernetes CronJob, Airflow) that logs execution history natively, rather than relying on Pushgateway.

Follow-up: If you push 1000 metrics per job and 100 jobs complete daily, how much storage does Prometheus need to store a year of historical timelines?