Prometheus Interview — Prometheus Operator in Kubernetes

Your team is deploying Prometheus to a Kubernetes cluster using Prometheus Operator. You've defined a PrometheusRule CRD with 50 alert rules, but alerts aren't firing. You check the Prometheus UI and see "0 loaded rules". What steps do you take to debug why the operator isn't loading rules?

PrometheusRule CRDs are discovered by Prometheus Operator via label selectors. Debugging: (1) Verify the PrometheusRule CRD is created correctly: 'kubectl get prometheusrule -n monitoring'. Check that the rule has the correct label matching the Prometheus resource's ruleSelector. (2) Check the Prometheus resource's ruleSelector: 'kubectl get prometheus prometheus-k8s -o yaml | grep -A5 ruleSelector'. It should match PrometheusRule's labels. (3) Verify the operator is watching PrometheusRules: check operator logs: 'kubectl logs -n monitoring prometheus-operator-xxx'. Look for "reconciling" messages. (4) Validate the PrometheusRule YAML syntax: 'kubectl get prometheusrule -n monitoring -o yaml'. Rules must have 'alert' field and valid PromQL expressions. Use 'promtool check rules' locally to validate. (5) Check RBAC permissions: the operator needs 'get/list/watch' on PrometheusRules. Verify: 'kubectl auth can-i list prometheusrules --as=system:serviceaccount:monitoring:prometheus-operator'. (6) Restart the Prometheus pod to force reload: 'kubectl delete pod -n monitoring prometheus-k8s-0'. (7) Check Prometheus pod logs: 'kubectl logs -n monitoring prometheus-k8s-0' for parsing errors.

Follow-up: If PrometheusRules are loaded but some alerts have syntax errors, does the entire rule file fail to load or just that rule?

You've configured Prometheus Operator to scrape metrics from all pods in the cluster using PodMonitor CRD. However, Prometheus is discovering and scraping 50,000 targets, consuming 40GB RAM and causing OOM kills. The cluster has 2,000 pods but scraping 25x that. What's causing this cardinality explosion?

Prometheus Operator likely has misconfigured scraping or relabeling. The PodMonitor is discovering pods correctly, but cardinality explosion suggests: (1) Multiple job names per pod: check PodMonitor spec for multiple ports—each port creates a separate job, multiplying targets. (2) Relabeling explosion: if PodMonitor uses metric_relabel_configs or scrape_interval_length_seconds incorrectly, targets can duplicate. (3) Namespace inclusion: verify PodMonitor namespaceSelector isn't matching all namespaces. Default is all namespaces; limit to specific ones: namespaceSelector: { matchNames: ['monitoring', 'prod'] }. (4) Pod selector too broad: podSelector: {} matches all pods. Use more specific selectors: podSelector: { matchLabels: { monitoring: 'true' } }. (5) Check for duplicate PodMonitors: 'kubectl get podmonitor --all-namespaces'. If multiple PodMonitors select the same pods, targets are scraped multiple times. (6) Verify each pod's exporter only exposes necessary metrics (not high-cardinality app metrics mixed with Prometheus metrics). (7) Use prometheus_tsdb_metric_chunks_created_total to audit cardinality growth. Fix by: reducing PodMonitor selectors, dropping high-cardinality metrics via metric_relabel_configs, and limiting namespaces.

Follow-up: If you have a PodMonitor and a ServiceMonitor for the same pod, does Prometheus scrape it twice?

Your team updates a PrometheusRule to change an alert threshold (e.g., 'up == 0' to 'up == 0 for 5m'). You apply the CRD, but the new rule doesn't take effect for 30 minutes. Prometheus UI still shows the old rule. What's causing the delay?

Prometheus Operator reconciles changes periodically (default 5min) and Prometheus reloads config on a schedule (default 30s for operator-managed Prometheus). Delays can be caused by: (1) Operator reconciliation interval: PrometheusOperator has --config-reloader-interval flag (default 5m). Reduce to 30s for faster updates: --config-reloader-interval=30s. (2) Prometheus config reload: after operator updates the config, Prometheus reloads via SIGHUP. This is usually fast (< 1s), but if Prometheus is under high load, reload can take time. (3) Alert evaluation: even after rules load, alerts don't fire until the next evaluation cycle (--evaluation-interval, default 15s). So a rule change can take 15s + reload time to take effect. (4) Caching: Prometheus caches instantVector results. If you query the same alert expression immediately after update, it might return cached (old) result. Wait 5 minutes or query different expression. To speed up: (a) Lower reconciliation interval in operator. (b) Force reload: use 'kubectl delete pod -n monitoring prometheus-operator-xxx' to restart operator, which forces immediate reconciliation. (c) Or manually trigger: 'curl -X POST http://prometheus:9090/-/reload' if enabled. (d) Verify changes with: 'kubectl get prometheusrule -n monitoring -o yaml | grep -A5 alert'.

Follow-up: If the PrometheusRule has a syntax error, does the operator fail the update or silently drop the rule?

You've defined a ServiceMonitor for a custom app that exposes metrics at '/custom-metrics' endpoint (not standard '/metrics'). However, Prometheus isn't scraping it. You've verified the ServiceMonitor YAML looks correct. What are the likely issues?

ServiceMonitor discovery and scraping depends on several factors. Debugging: (1) Verify the Service exists: 'kubectl get service -n app-namespace my-service'. ServiceMonitor must reference an existing Service. (2) Check ServiceMonitor labels match Service: ServiceMonitor has selector.matchLabels, which must match Service labels. Example: ServiceMonitor.spec.selector.matchLabels: { app: 'my-app' } must match Service.metadata.labels: { app: 'my-app' }. (3) Verify endpoint path: ServiceMonitor.spec.endpoints[0].path must match the actual metrics path: path: '/custom-metrics'. (4) Check port name: endpoints[0].port must match Service.spec.ports[].name. Example: port: 'metrics' requires Service to have a port named 'metrics', not 'http'. (5) Validate Prometheus selector: Prometheus resource has serviceMonitorSelector that filters ServiceMonitors by label. Ensure ServiceMonitor has matching labels. Example: Prometheus.spec.serviceMonitorSelector.matchLabels: { release: 'prometheus' } requires ServiceMonitor to have label release: 'prometheus'. (6) Check app pod is running: 'kubectl get pods -n app-namespace'. If pod is not running, no targets. (7) Verify metrics endpoint responds: 'kubectl port-forward service/my-service 8080:8080' then 'curl http://localhost:8080/custom-metrics'.

Follow-up: If ServiceMonitor label selector is empty (matchLabels: {}), does it match all Services in the namespace or only those with no labels?

Your Prometheus Operator setup scrapes 100 targets, but you notice a specific target (Kubernetes API server) is being scraped every 5 seconds instead of every 30 seconds (the global scrape_interval). How does Prometheus Operator determine per-target scrape intervals?

Scrape interval is set globally in Prometheus.spec.scrapeInterval (default 30s) but can be overridden per-target using ServiceMonitor or PodMonitor endpoints[].interval. Debugging: (1) Check Prometheus resource: 'kubectl get prometheus -o yaml | grep scrapeInterval'. (2) Check ServiceMonitor: 'kubectl get servicemonitor -n kube-system | grep kube-apiserver'. Look for ServiceMonitor that targets the API server. (3) Inspect the ServiceMonitor YAML: 'kubectl get servicemonitor/kube-apiserver -n kube-system -o yaml | grep -A5 endpoints'. If endpoints[].interval is set to '5s', that overrides global. (4) To change: either remove the interval field (uses global scrapeInterval) or update it: 'kubectl patch servicemonitor kube-apiserver -n kube-system --type merge -p { \"spec\": { \"endpoints\": [ { \"interval\": \"30s\" } ] } }'. (5) After patching, Prometheus Operator reconciles (5min default) and reloads Prometheus config. Force faster: delete Prometheus pod or lower operator reconciliation interval. Per-target intervals are useful for critical services (scrape more frequently) or expensive exporters (scrape less frequently), but should be used judiciously to avoid unintended gaps.

Follow-up: If you set scrape_interval: 5s globally but a ServiceMonitor has interval: 30s, which takes precedence?

You're implementing multi-cluster monitoring using Prometheus Operator. Each cluster has its own Prometheus instance, but you need a global Prometheus to federate and query metrics from all clusters. How do you set up federation with Operator-managed Prometheus instances?

Set up a central Prometheus that scrapes the /federate endpoint from each regional Prometheus. Use Operator to define this: (1) Create a global ServiceMonitor that scrapes regional Prometheus instances via an external-name Service (if regional instances are outside the cluster). Example: Service.spec.type: 'ExternalName', externalName: 'prometheus-us-east.example.com'. (2) Create ServiceMonitor: apiVersion: monitoring.coreos.com/v1, kind: ServiceMonitor, spec: { selector: { matchLabels: { instance: 'regional' } }, endpoints: [ { port: 'web', path: '/federate', params: { 'match[]': ['up', 'node_cpu_seconds_total'] } } ] }. (3) The operator automatically adds this to the global Prometheus scrape config. (4) Add external_labels in Prometheus.spec.externalLabels to tag federated metrics by region: externalLabels: { cluster: 'us-east' }. (5) For per-region retention, use retention: '15d' at global level; regional instances keep '1y'. (6) Use Prometheus.spec.additionalScrapeConfigs (or ConfigMap) for advanced scraping options not available in ServiceMonitor CRD (e.g., bearer_token for auth). Apply via: 'kubectl create configmap additional-scrape-configs --from-literal=prometheus-additional.yml="..."'. (7) Alternatively, use a custom ServiceMonitor with raw YAML in metadata.annotations (if Operator supports it).

Follow-up: If you're federating from multiple clusters and each has external_labels: { cluster: 'name' }, can you get duplicate cardinality issues?

Your Prometheus Operator setup uses a custom TLS certificate for scraping targets. You've created a Secret with the certificate, but Prometheus is rejecting the cert as invalid ("x509: certificate signed by unknown authority"). How do you debug and fix TLS issues in Operator-managed scraping?

TLS configuration in ServiceMonitor is done via tlsConfig block. Debugging: (1) Verify Secret exists: 'kubectl get secret -n monitoring my-tls-secret'. (2) Check ServiceMonitor tlsConfig: 'kubectl get servicemonitor -o yaml | grep -A10 tlsConfig'. It should reference the Secret: tlsConfig: { caFile: '/etc/prometheus/certs/ca.crt' } (file path inside the container). (3) Operator mounts Secrets at /etc/prometheus/secrets/{secret-name}/{key}. So Secret my-tls-secret with key ca.crt mounts at /etc/prometheus/secrets/my-tls-secret/ca.crt. Update ServiceMonitor: tlsConfig: { caFile: '/etc/prometheus/secrets/my-tls-secret/ca.crt' }. (4) Verify certificate is PEM-encoded and valid: 'openssl x509 -in secret-file.crt -text -noout'. (5) Check Prometheus pod mounts: 'kubectl exec prometheus-k8s-0 -- ls /etc/prometheus/secrets/'. (6) Check scrape logs: 'kubectl logs prometheus-k8s-0 | grep -i tls' or 'curl -vvv https://target:port' from pod. (7) If issuer is self-signed, set insecureSkipVerify: true temporarily (not for production): tlsConfig: { insecureSkipVerify: true }. (8) For client certificates (mutual TLS), add certFile and keyFile to tlsConfig, both mounted from Secret.

Follow-up: If a Secret is updated after Prometheus is deployed, how quickly does Operator refresh the mounted certificate?

You're using PrometheusRule with recording rules to pre-aggregate metrics. However, you notice the recording rules are evaluating on the 'monitoring' namespace Prometheus but the alerts depend on them. If you scale to multiple Prometheus instances (one per namespace), the recording rules won't exist for all instances. How do you share recording rules across multiple Prometheus instances?

PrometheusRules are cluster-scoped; they can be accessed from any Prometheus that has the correct ruleSelector and namespace. For multi-instance setup: (1) Create a PrometheusRule in a shared namespace (e.g., 'monitoring'): kubectl apply -f recording-rules.yaml -n monitoring. (2) Each Prometheus instance must have ruleSelector that matches the rule's labels. Example: Prometheus in 'app-ns' has ruleSelector: { matchLabels: { shared: 'true' } }. Add label to PrometheusRule: metadata: { labels: { shared: 'true' } }. (3) Operator discovers the rule via label match and loads it into both Prometheus instances. (4) For namespace-specific rules, create separate PrometheusRules: one in 'app-ns' namespace with namespaceSelector.matchNames: ['app-ns']. (5) Verify rules load: 'kubectl get prometheusrule --all-namespaces' and check each Prometheus UI (http://prometheus:9090/graph, search Rules). (6) If sharing depends on remote evaluation (e.g., one Prometheus evaluates, others query the results), use remote_write to a shared backend (Thanos, Mimir), so all instances access pre-aggregated data. (7) For complex setups with 10+ instances, maintain recording rules in a GitOps repo and deploy to all namespaces via Helm or kustomize, ensuring consistency.

Follow-up: If two PrometheusRules with the same name exist in different namespaces, does one override the other or do both load?