Prometheus Interview — Security, Authentication, and TLS

Your Prometheus instance is exposed on the public internet (by accident). An engineer discovers it and starts running queries. There's no authentication, no TLS, and metrics contain sensitive information (database passwords in labels, API keys in annotations). How do you quickly secure it?

Emergency mitigation: (1) Immediately block public access: move Prometheus behind Nginx/HAProxy with authentication, or restrict network access via firewall rules. Use AWS security groups, iptables, or Kubernetes NetworkPolicy to deny external access. (2) Enable basic auth on Prometheus: use Nginx as a proxy with auth module: location /prometheus { auth_basic "Prometheus"; auth_basic_user_file /etc/nginx/.htpasswd; proxy_pass http://prometheus:9090; }. (3) Add TLS: generate self-signed cert (openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/nginx/prometheus.key -out /etc/nginx/prometheus.crt). Configure Nginx for HTTPS. (4) Change Prometheus config: disable endpoints that allow configuration changes. Set --web.enable-admin-api=false (disables /-/admin/tsdb/delete_series, etc.). (5) Audit: check Prometheus query logs. Identify what queries were run by the attacker. Use prometheus_tsdb_symbol_table_size_bytes to see if data was downloaded. (6) Remediate: if metrics contain secrets (passwords in labels), add metric_relabel_configs to drop or redact sensitive labels: metric_relabel_configs: [ { regex: 'password|api_key', action: 'labeldrop' } ]. (7) Notify teams: check for exposed credentials and rotate them (database passwords, API keys). (8) Long-term: implement proper IAM/RBAC, VPN-only access, and audit logging for all Prometheus queries.

Follow-up: If an attacker downloaded Prometheus TSDB blocks, can they extract time-series data offline?

You're setting up TLS for Prometheus scraping targets. Your targets use self-signed certificates. Prometheus is rejecting scrapes with "x509: certificate signed by unknown authority". How do you configure Prometheus to accept self-signed certs securely?

TLS verification by default rejects self-signed certs. Secure workarounds: (1) Add CA certificate: generate or obtain the CA cert that signed the self-signed cert. Create a Secret in Kubernetes (if using K8s): kubectl create secret generic target-ca --from-file=ca.crt=/path/to/ca.crt -n monitoring. (2) Configure Prometheus scrape config: scrape_configs: [ { job_name: 'target', tls_config: { ca_file: '/etc/prometheus/certs/ca.crt' }, scheme: 'https', static_configs: [{ targets: ['target:8443'] }] } ]. (3) For self-signed certs where you have the cert file: obtain the cert from the target: openssl s_client -connect target:8443 -showcerts < /dev/null | openssl x509 -outform PEM > cert.pem. Add to Prometheus. (4) Temporary workaround (not production): set insecureSkipVerify: true to disable cert verification: tls_config: { insecureSkipVerify: true }. This is vulnerable to MITM; use only temporarily for testing. (5) For mutual TLS (client authentication): add certFile and keyFile: tls_config: { caFile: '/etc/prometheus/certs/ca.crt', certFile: '/etc/prometheus/certs/client.crt', keyFile: '/etc/prometheus/certs/client.key' }. (6) Distribute certs: mount Secrets as volumes in Prometheus pod. In StatefulSet: volumeMounts: [ { name: 'tls-certs', mountPath: '/etc/prometheus/certs' } ]. volumes: [ { name: 'tls-certs', secret: { secretName: 'target-ca' } } ]. (7) Certificate rotation: use cert-manager or a sidecar to auto-refresh certs before expiration. Prometheus reloads on SIGHUP or auto-reload flag.

Follow-up: If a target certificate expires and you don't rotate it, how long before Prometheus scrape fails?

You're implementing Prometheus authentication in production. You have 3 different groups (developers, ops, compliance) with different query permissions: developers can query any metric, ops can query infrastructure metrics only, compliance can query audit metrics only. How do you implement fine-grained RBAC?

Prometheus has no built-in RBAC; all queries are unrestricted. Implement RBAC via proxy: (1) Use a proxy layer (e.g., Nginx, Trickster, or custom Go service) in front of Prometheus. Proxy authenticates users and checks permissions before forwarding queries. (2) Nginx setup: location /api/v1/query { auth_request /auth; proxy_pass http://prometheus:9090; }. The /auth endpoint checks user and allowed labels. (3) Custom Go proxy: intercept PromQL queries, parse expressions, verify user can access the queried metrics. Example: { developer: { allowed_metrics: ['.*'] }, ops: { allowed_metrics: ['node_.*', 'container_.*'] }, compliance: { allowed_metrics: ['audit_.*'] } }. (4) Label-based filtering: add a label to metrics indicating access level (team, sensitivity). In queries, append label filters automatically based on user role. Example: query 'http_requests_total' → proxy rewrites to 'http_requests_total{team=~"users_team"}' for that user. (5) Grafana as RBAC layer: Grafana has built-in RBAC (org, team, dashboard permissions). Users access Prometheus only via Grafana dashboards, not direct query. Each dashboard is restricted to specific roles. (6) Kubernetes RBAC: if Prometheus is in K8s, use NetworkPolicy to restrict who can access Prometheus API (only proxy can access). (7) Audit logging: log all queries with user, IP, timestamp. Use Nginx access logs or a custom proxy with logging. (8) Per-metric encryption: encrypt sensitive metrics at rest (optional). Requires external keys and decryption proxy, adds complexity.

Follow-up: If a proxy implements RBAC and denies a query, should it return 403 Forbidden or just no data?

You're implementing API authentication for Prometheus using bearer tokens. Developers need to query Prometheus programmatically. How do you issue and manage bearer tokens securely?

Bearer token authentication for programmatic access: (1) Token generation: create a token management service or use an external identity provider (Vault, AWS IAM, Keycloak). For development, use JWT (JSON Web Tokens): { alg: 'HS256', sub: 'developer_app', scope: 'query:prometheus', exp: 1735689600 }. (2) Token storage: never hardcode in code. Store in: (a) Environment variables (development only). (b) Kubernetes Secrets: kubectl create secret generic prometheus-token --from-literal=token=... (c) HashiCorp Vault: app retrieves token at runtime from Vault. (d) AWS Secrets Manager / GCP Secret Manager for cloud-native. (3) Token validation: Nginx can validate tokens via lua script or external authentication service. Example: location /api/v1/query { set $token $http_authorization; access_by_lua_file /usr/local/nginx/lua/validate_token.lua; proxy_pass http://prometheus:9090; }. (4) Token scope: issue tokens with specific scopes (query, alert, admin) and restrict endpoints. Example: 'query:prometheus' allows GET on /api/v1/query; 'admin:prometheus' allows /-/admin/tsdb. (5) Token expiration: short-lived tokens (1 hour) for security, with refresh token for long-term access. Implement token refresh: app exchanges expired token + refresh_token for new token. (6) Rate limiting: apply rate limits per token to prevent abuse. Use Nginx limit_req with token ID as key. (7) Monitoring: log all token authentication attempts. Alert on failures (e.g., invalid token, expired token). Track token usage to detect anomalies. (8) Revocation: maintain a token blacklist for revoked tokens (revoked before expiration). Check blacklist during validation.

Follow-up: If a bearer token is leaked, how do you quickly revoke it across all Prometheus replicas?

You're running Prometheus in a Kubernetes cluster with a network policy that restricts traffic to the Prometheus namespace. However, alertmanager pods (in a different namespace) need to query Prometheus for webhook validation. How do you implement cross-namespace network access securely?

Kubernetes network policies restrict all traffic by default (deny-all policy). For cross-namespace access: (1) Create an explicit NetworkPolicy that allows Alertmanager → Prometheus: apiVersion: networking.k8s.io/v1, kind: NetworkPolicy, metadata: { namespace: 'monitoring', name: 'allow-alertmanager' }, spec: { podSelector: { matchLabels: { app: 'prometheus' } }, policyTypes: ['Ingress'], ingress: [ { from: [ { podSelector: { matchLabels: { app: 'alertmanager' } }, namespaceSelector: { matchLabels: { name: 'alerting' } } } ], ports: [ { protocol: 'TCP', port: 9090 } ] } ] }. (2) This allows pods labeled 'app=alertmanager' in namespace 'alerting' to reach pods labeled 'app=prometheus' in namespace 'monitoring' on port 9090. (3) To further restrict, specify source IPs: add ipBlock: { cidr: '10.0.0.0/8' } to the NetworkPolicy. (4) For TLS enforcement: require TLS between Alertmanager and Prometheus. Use a sidecar (e.g., Envoy) to terminate TLS on both sides. (5) Service-to-service authentication (mTLS): use Istio or Linkerd to enforce mTLS. Alertmanager automatically uses TLS to connect to Prometheus without config changes. (6) Firewall rules: if using cloud (AWS, GCP), add security group/firewall rules in addition to NetworkPolicy. Example: AWS: source SG (Alertmanager) to destination SG (Prometheus) on port 9090. (7) Monitoring: log all connections to Prometheus. Alert if unexpected namespaces attempt connection. Use Prometheus metrics: 'rate(prometheus_http_requests_total[5m])' to track requests by source IP/pod.

Follow-up: If NetworkPolicy is created but Alertmanager still can't reach Prometheus, what debugging steps do you take?

Your organization requires compliance auditing: every query against sensitive metrics (audit_*, security_*) must be logged with user, timestamp, query, and result size. Prometheus has no built-in audit logging. How do you implement query audit logging?

Prometheus query audit logging requires a proxy or wrapper: (1) Proxy approach: intercept all /api/v1/query* requests via Nginx/custom service. Log: timestamp, user (from auth header), query, remote IP, response code, response size. Example Nginx config: log_format prometheus '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" query="$query_string"'; access_log /var/log/nginx/prometheus_audit.log prometheus; (2) Custom Go proxy: parse PromQL queries and detect sensitive metrics. If query contains 'audit_.*' or 'security_.*', log with additional fields (query complexity, estimated sample count). Forward only if user has permission. (3) Prometheus API logging: Prometheus has experimental --log.request-lines flag (Prometheus 2.26+) which logs all HTTP requests: --log.level=debug logs query details. Not for production (high volume). (4) External audit service: configure a webhook that receives audit events. After each query, proxy sends audit log to audit service: POST /audit { user, timestamp, query, result_size, status }. (5) Centralized logging: forward audit logs to ELK (Elasticsearch), Splunk, or cloud logging (Stackdriver, CloudWatch). Query audit logs independently of Prometheus. (6) Long-term storage: audit logs must be immutable and stored separately from Prometheus. Use S3 with versioning/lock or a dedicated audit DB. (7) Monitoring: alert on suspicious query patterns (e.g., user A querying all audit_* metrics in one hour, extracting large datasets). Track query volume by user.

Follow-up: If an attacker uses PromQL's debug feature to bypass audit logging, how do you detect this?

You're implementing OAuth2 integration for Prometheus access. Users should authenticate via a corporate OAuth provider (Okta, GitHub, Google). After login, they get a session and can query Prometheus. How do you implement OAuth2 + Prometheus?

OAuth2 integration requires a proxy or reverse proxy with OAuth support: (1) Use Oauth2-proxy (or oauth2proxy): open-source reverse proxy that handles OAuth2 flow. Deploy as sidecar or standalone. Configuration: oauth2_proxy --provider=oidc --client-id=... --client-secret=... --redirect-url=https://prometheus.company.com/oauth2/callback --upstream=http://prometheus:9090. (2) OIDC provider setup: register Prometheus as an application in your OAuth provider (Okta, GitHub, Google). Provide: (a) Redirect URL: https://prometheus.company.com/oauth2/callback. (b) Scopes: openid, profile, email. (c) Client ID/Secret from provider. (3) Proxy flow: (a) User visits https://prometheus.company.com. (b) Proxy detects no session, redirects to OAuth provider. (c) User logs in to provider. (d) Provider redirects to /oauth2/callback with authorization code. (e) Proxy exchanges code for access token, stores in session cookie. (f) Proxy forwards to Prometheus. (4) Session management: use encrypted cookies or Redis backend for sessions. Cookie: oauth2_proxy uses 'oauth2_proxy' cookie (encrypted with secure key). Rotation: regenerate encryption key periodically. (5) Group/role mapping: extract groups from OAuth token (via custom claims), then use nginx auth_request to check authorization. Example: if user's OAuth group is 'ops', allow access; otherwise deny. (6) Scope restriction: request minimal OAuth scopes. For example, request 'profile' and 'email' only, not 'admin' or 'write'. (7) Monitoring: log all OAuth authentication events (success, failure, token expiration). Alert on failed logins (potential brute-force).

Follow-up: If an OAuth provider (Okta) goes down, can users still access cached sessions in Prometheus?

You have multiple Prometheus instances (per-team shards), each with different access controls. Your data is multi-tenant: team A's metrics should never be visible to team B. However, a shared dashboard needs cross-team metrics (aggregated view for management). How do you balance isolation and visibility?

Multi-tenant Prometheus requires careful architecture: (1) Isolation model: each team has a Prometheus instance with only their metrics. External_labels: { team: 'team_a' }. This ensures team A's Prometheus has no visibility into team B's data. (2) Query proxy with role-based routing: when a user queries, proxy checks their role and routes to the appropriate Prometheus instance. User 'team_a_member' → queries team_a's Prometheus. User 'admin' → can query multiple or all Prometheus instances. (3) Aggregation for cross-team dashboards: (a) Create a separate 'aggregation' Prometheus or Thanos instance that has read-only access to all team instances. This instance is for read-only queries only. (b) All team instances have a query proxy that restricts write operations (config reload, data deletion). (c) The aggregation instance queries via /federate endpoint from each team Prometheus. (4) Fine-grained access control: Implement a role matrix: { 'team_a_member': ['team_a:query'], 'manager': ['team_a:query', 'team_b:query', 'team_c:query'], 'admin': ['*:query', '*:admin'] }. (5) Data encryption: for additional security, encrypt team data at rest. Mimir/Cortex support tenant-level encryption; single-instance Prometheus requires external encryption (hardware-level or S3 encryption). (6) Audit logging: log cross-team queries. Alert if team_a member tries to query team_b Prometheus. (7) Network isolation: each team Prometheus is in a separate network segment or VPC. No direct network path between team instances. Aggregation instance is in a separate, trusted network.

Follow-up: If a manager's credentials are leaked, how do you limit blast radius (they have access to all team Prometheus instances)?