Prometheus Interview — Scrape Configuration and Target Management

You're configuring Prometheus to scrape 500 targets from a dynamic environment (Kubernetes, EC2 with auto-scaling). Targets come and go frequently. How do you configure scrape_configs to handle dynamic target discovery efficiently?

Dynamic target discovery uses service discovery (SD) plugins. Configuration: (1) Kubernetes SD: scrape_configs: [ { job_name: 'kubernetes', kubernetes_sd_configs: [ { role: 'pod', namespaces: { names: ['default', 'prod'] } } ], relabel_configs: [ { source_labels: [__meta_kubernetes_pod_annotation_scrape], action: 'keep', regex: 'true' } ] } ]. This discovers all pods in 'default' and 'prod' namespaces. (2) EC2 SD: ec2_sd_configs: [ { region: 'us-east-1', port: 9100 } ]. Discovers EC2 instances by tags. (3) Consul SD: consul_sd_configs: [ { server: 'consul.service.consul:8500' } ]. Discovers services registered in Consul. (4) SD refresh interval: controls how often to refresh target list. Default 30s. Reduce to 5-10s for faster target pickup (higher API load). (5) Relabeling: after SD discovers targets, use relabel_configs to filter or transform. Keep only targets with specific labels or drop unwanted ones. (6) Drop targets: if a target doesn't have annotation 'scrape=true', use 'action: drop' to exclude it. (7) Monitoring: track prometheus_sd_discovered_targets (total discovered) vs prometheus_sd_up_targets (currently being scraped). If discovered > up, some targets are filtered or failing. (8) Scaling: for 500+ targets, SD can be slow. Optimize by: (a) filtering at SD level (namespaces, tags) to reduce API calls, (b) increasing scrape_interval slightly if freshness can tolerate it, (c) using multiple Prometheus instances with sharded target lists.

Follow-up: If a target is discovered but never scraped (relabel drops it), does it count toward cardinality or is it ignored?

You have 1000 targets, but only 100 should be scraped (others are backups or test instances). Scraping all 1000 would waste bandwidth and CPU. How do you implement efficient target filtering in Prometheus?

Target filtering at the scrape level: (1) Relabel keep/drop: use relabel_configs with 'action: keep' or 'action: drop'. Example: only scrape targets tagged with 'monitoring: true': relabel_configs: [ { source_labels: [__meta_ec2_tag_monitoring], action: 'keep', regex: 'true' } ]. Targets without this tag are not scraped. (2) Filter at SD source: reduce SD query scope. For Kubernetes: namespaceSelector: { matchNames: ['prod'] } scrapes only 'prod' namespace, ignoring 'test', 'staging'. (3) Multiple scrape jobs: create separate jobs for different target groups. Example: job 'prod_targets' for production, job 'staging_targets' for staging. Each job has specific scrape_interval and target filters. (4) Negative filter (drop): use regex to drop unwanted targets. Example: drop all targets with name 'test_*': relabel_configs: [ { source_labels: [__meta_kubernetes_pod_name], action: 'drop', regex: 'test_.*' } ]. (5) Service discovery filtering: some SD backends support filtering at discovery time. Consul: use service filters. Kubernetes: use label selectors. (6) Label mutation: add filtering labels during relabel. Example: 'monitoring' label is set to 'true' for production, 'false' for test. In relabel, check this label. (7) Cost-benefit: filtering reduces scrape volume and storage. Trade-off: if you want to monitor 'test' targets later, you lose historical data. Use filtering for genuinely unwanted targets (backups, decommissioned instances).

Follow-up: If you use 'action: drop' to exclude a target, can you later enable it without restarting Prometheus?

You're scraping a target with non-standard port and path. Exporter is at 'app.example.com:8443/monitoring/metrics'. Standard scrape config uses '/metrics' on port 9090. How do you configure custom scrape paths and ports?

Custom paths and ports are configured per-target: (1) Static targets: static_configs: [ { targets: ['app.example.com:8443'] } ]. (2) Custom path: scrape_path parameter: scrape_path: '/monitoring/metrics'. (3) Custom scheme (http vs https): scheme: 'https'. (4) Full example: scrape_configs: [ { job_name: 'custom_target', scheme: 'https', scrape_path: '/monitoring/metrics', scrape_interval: '30s', static_configs: [ { targets: ['app.example.com:8443'] } ] } ]. (5) Multiple targets with same custom path: targets: [ 'app1.example.com:8443', 'app2.example.com:8443' ]. (6) Dynamic targets (service discovery): if using Kubernetes SD, use relabel to set custom path: relabel_configs: [ { source_labels: [__meta_kubernetes_pod_annotation_scrape_path], target_label: __metrics_path__, action: 'replace', regex: '([^;]+)(?:;|$)', replacement: '${1}' } ]. This reads the 'scrape_path' annotation from pod and uses it. (7) For port mapping: if SD discovers targets on port X but exporter is on port Y, use relabel to rewrite: relabel_configs: [ { source_labels: [__address__], target_label: __address__, action: 'replace', regex: '([^:]+)(?::\\d+)?', replacement: '${1}:8443' } ]. (8) Parameters: if exporter requires query parameters (e.g., 'format=prometheus'), use params: params: { 'format': ['prometheus'], 'timeout': ['10s'] }. This appends ?format=prometheus&timeout=10s to the scrape URL.

Follow-up: If a custom metrics path returns 404, does Prometheus retry or mark the target as down?

Your scrape targets sometimes timeout (network slow, exporter hangs). Prometheus waits for the full scrape_timeout (default 10s) before giving up, blocking other scrapes. How do you handle slow/hanging targets?

Scrape timeout and performance tuning: (1) Reduce scrape_timeout: set to 5s or lower. If a target doesn't respond within 5s, mark as failed and move to next. (2) Global timeout: global: { scrape_timeout: '5s' } applies to all jobs. Override per-job: scrape_timeout: '3s' for critical targets, '15s' for slow targets. (3) Scrape parallelization: Prometheus scrapes targets in parallel. Default: 100 concurrent scrapes (configurable via --query.max-concurrency... actually this is query concurrency, not scrape). Scrape concurrency is not configurable; Prometheus scrapes all targets per interval in parallel. However, if you have 1000 targets and each times out after 10s, Prometheus may queue scrapes. (4) Identify slow targets: check 'prometheus_target_interval_length_seconds' and 'prometheus_target_sync_length_seconds' to see per-target latency. Query: 'histogram_quantile(0.95, prometheus_target_interval_length_seconds)' shows P95 target scrape latency. (5) Timeout-sensitive scrapes: for targets that often timeout, set longer timeout: scrape_timeout: '30s'. But this slows down the scrape job. Better: contact the exporter owner to optimize (add caching, reduce cardinality). (6) Scrape result caching: if a target times out but returned data last time, some monitoring systems use the previous scrape result. Prometheus doesn't do this by default (strict timeout failure). (7) Circuit breaker: after N consecutive timeouts, stop scraping a target temporarily. This isn't built into Prometheus but can be implemented via proxy/sidecar. (8) Monitoring: alert on high scrape latency: 'prometheus_target_interval_length_seconds > 9s' indicates targets are close to timeout. Investigate and optimize.

Follow-up: If a target consistently times out, does Prometheus continue attempting to scrape it or does it give up?

You're scraping targets behind a corporate proxy. Your Prometheus needs to route all scrape traffic through the proxy (HTTP_PROXY environment variable doesn't work). How do you configure Prometheus for proxy scraping?

Proxy configuration for Prometheus scrapes: (1) Prometheus environment variable: set HTTP_PROXY and HTTPS_PROXY when starting Prometheus: 'HTTP_PROXY=http://proxy.corp:3128 HTTPS_PROXY=http://proxy.corp:3128 prometheus'. These apply to all scrapes (and remote_write, federation). (2) Per-job proxy: Prometheus doesn't have built-in per-job proxy config. All scrapes use global proxy (environment variable). (3) If targeting different proxies by job, use a sidecar/proxy layer. Example: a local Nginx proxy that routes different targets to different upstream proxies based on URL patterns. (4) Proxy authentication: if proxy requires credentials: 'HTTP_PROXY=http://user:pass@proxy.corp:3128'. Or set via basic auth header in relabel_configs: relabel_configs: [ { target_label: __proxy_auth__, replacement: 'user:pass' } ]. Then configure scrape to include auth header. (Actually, __proxy_auth__ is not standard; auth should be in URL or environment variable). (5) SOCKS proxy: for SOCKS5 proxies, environment variable doesn't work (HTTP_PROXY is for HTTP proxies). Use a local SOCKS tunnel or sidecar. Example: 'ssh -D 1080 proxy.corp' creates SOCKS tunnel on localhost:1080. Then configure Prometheus: HTTP_PROXY=socks5://localhost:1080. (6) Verify proxy: test proxy connectivity: 'curl -x http://proxy.corp:3128 http://target:9090/metrics'. If proxy is blocking Prometheus, contact proxy admin to whitelist Prometheus IPs. (7) Monitoring: check 'prometheus_tsdb_metric_chunks_created_total' to verify scrapes are succeeding through proxy. If stuck, proxy may be dropping traffic.

Follow-up: If proxy authentication credentials are in the HTTP_PROXY env var, how do you prevent them from being exposed in process listings?

You're scraping targets from multiple datacenters with different networks. Some targets require scraping from a specific network gateway. How do you implement source IP pinning or route-based scraping?

Source IP pinning for Prometheus scrapes: (1) OS-level binding: Prometheus can't configure source IP per-target (not supported in config). However, you can set Prometheus's default bind address: --web.listen-address=192.168.1.10:9090 (this is for Prometheus's web interface, not scrape source). (2) For scrape source IP, modify iptables or network routing on the Prometheus host. Example: route traffic to 10.1.0.0/16 through gateway A, 10.2.0.0/16 through gateway B. This is OS-level routing, not Prometheus config. (3) Container/Pod approach: if Prometheus runs in Docker/Kubernetes, assign network interfaces to container. Example: Docker: --ip=192.168.1.10 --network my_network. Kubernetes: pod network policy, source network. (4) Sidecar proxy: deploy a proxy (e.g., socat, haproxy) that routes scrapes through the correct gateway. Prometheus scrapes the proxy (localhost:8080), proxy forwards to actual target through the designated gateway. (5) Virtual interfaces: add multiple virtual IPs to the Prometheus host: 'ip addr add 192.168.1.100/24 dev eth0'. Configure scrape targets to use specific source IP. However, Prometheus doesn't support source IP selection per-target. (6) Multiple Prometheus instances: deploy separate Prometheus instances in each datacenter / network, each with targets in that network. Federate via query proxy. (7) SD config per-gateway: use Kubernetes node affinity or pod topology to schedule Prometheus on specific nodes that have access to specific networks. (8) Practical solution: use a proxy sidecar and service discovery that assigns targets to the correct Prometheus instance by network. This is complex; simpler to just route via OS-level networking.

Follow-up: If you have targets in 10 different networks and each requires a different gateway, can a single Prometheus instance scrape all?

Your Prometheus scrape config has a typo in a regex (e.g., 'regex: "[a-z"' missing closing bracket). The config loads without error, but the relabeling doesn't work. Targets are silently dropped or misrouted. How do you validate scrape config regex?

Scrape config validation: (1) Use promtool: 'promtool check config prometheus.yml' validates YAML syntax. However, it doesn't validate regex patterns. (2) Regex validation: use 'promtool test-sd-config --config prometheus.yml --job job_name' to test service discovery + relabeling. Promtool simulates SD and applies relabel_configs, showing which targets survive filtering and what their final labels are. (3) Separate regex tester: use an online regex tester (regex101.com) or 'grep -E' locally to test patterns: 'echo "test" | grep -E "[a-z"' (invalid regex). (4) Prometheus debug logs: enable --log.level=debug and check logs for regex parsing errors. Some version log regex compilation failures. (5) Manual testing: apply scrape config to a small set of targets (e.g., 1 target) and verify behavior. Check Prometheus UI (Status → Targets) to see which targets are discovered and which are filtered. (6) In CI/CD: add a regex validation step in your deployment pipeline. Use promtool test-sd-config before deploying new scrape configs. (7) Monitoring: track 'prometheus_sd_discovered_targets' vs 'prometheus_sd_up_targets'. If discovered is much higher than up, investigate relabel rules for bugs. (8) Common errors: (a) forgetting to escape special regex chars: 'regex: "\.metric\."' (correct), 'regex: ".metric."' (wrong, matches any char). (b) using '|' (OR) without escaping: 'regex: "(a|b)"' (correct), not 'regex: "(a\|b)"'. (c) capture groups: 'regex: "([a-z]+)", replacement: "$1"' (group $1); if regex has no groups and you reference $1, it fails silently.

Follow-up: If a regex is invalid and a relabel rule fails, does the entire job fail or just that rule?

You're scraping cloud infrastructure (AWS, GCP) where targets have thousands of tags/labels. Your scrape config exposes all of them as metric labels (cardinality explosion). How do you selectively include/exclude labels during scrape?

Label filtering during scrape: (1) Label drop: use metric_relabel_configs to drop high-cardinality labels before storing: metric_relabel_configs: [ { regex: '__meta_aws_.*', action: 'labeldrop' } ]. This removes all AWS SD labels from metrics. (2) Label keep (inverse): use label_map to keep only desired labels: metric_relabel_configs: [ { regex: 'job|instance|__name__|env', action: 'labelkeep' } ]. Keeps only job, instance, __name__, env; drops everything else. (3) Selective relabel: during relabeling (before scrape), explicitly extract desired labels: relabel_configs: [ { source_labels: [__meta_aws_tag_team], target_label: team, action: 'replace' }, { source_labels: [__meta_aws_tag_env], target_label: env, action: 'replace' }, { regex: '__meta_.*', action: 'labeldrop' } ]. This extracts specific tags and drops the rest. (4) For cloud SD (EC2, GCP): set a label_limit or use SD filters to reduce labels at source. Example: EC2 SD only exports tags matching a regex: tag_filter: { key: 'Name', value: 'production.*' }. (5) Cardinality control: track 'prometheus_tsdb_metric_chunks_created_total' to monitor cardinality. If cardinality grows due to labels, investigate which labels are high-cardinality (use 'count by (label) (metric)' in Prometheus). (6) Exporter config: if the exporter exposes labels, configure it to drop unnecessary labels before exposing. Example: Node Exporter with --collector.textfile.directory to collect custom metrics with curated labels only. (7) Recording rules: pre-aggregate metrics without high-cardinality labels: 'record: app:requests:total = sum(app_requests_total) by (job, service)'. Drops pod/instance/hostname. (8) For production: profile your exporter. 'curl http://target:9090/metrics | wc -l' shows metric count. If > 10k metrics, reduce cardinality.

Follow-up: If you drop labels via metric_relabel_configs that are needed for alerting, can you recover them?