Prometheus Interview — Cardinality Explosion and Prevention

Microservice exports metrics with incrementing timestamp label. Cardinality explodes 5M → 500M in 6 hours. Service offline. Recover?

Execute rapid remediation: (1) Stop offending scrape target immediately by commenting job in prometheus.yml, reload via SIGHUP. (2) Use metric relabeling to drop problematic label on restart: add `metric_relabel_configs: [{source_labels: [timestamp], action: drop}]`. (3) Clean existing data: run `promtool tsdb delete-series --matchers '{exporter="bad_service"}' {data_dir}`. (4) Trigger compaction via Prometheus restart to reconstruct indexes. (5) Verify recovery with `count({{__name__=~".+"}})`.

Follow-up: Design metric validation framework catching violations before deployment?

Kubernetes monitoring scrapes container labels (git SHA, build ID, timestamp) changing frequently. Cardinality explodes. Design relabeling.

Intelligent relabeling: (1) Drop high-cardinality labels at scrape time via `metric_relabel_configs`. (2) For required labels, use whitelist: only keep specific ones. (3) Label mapping: map Pod UID (high-cardinality) to static hash. (4) Store pod mapping externally in etcd. (5) Use pod metadata as info metrics: `kubernetes_pod_info{pod_id_hash="42"}` with `group_left()` join. (6) For debugging, query logs via Loki handling high-cardinality better.

Follow-up: Automatically suggest relabeling rules based on detected patterns?

SaaS exporter with user_id label. User base grows 10K → 500K over 3 months. Prometheus memory-constrained. Scale?

Progressive scaling: (1) Immediate: drop user_id if not essential via `metric_relabel_configs`. If needed, hash to low-cardinality buckets. (2) Mid-term: implement sampling to 10% original cardinality. (3) Deploy Prometheus federation: split scrape targets across instances, use Thanos/Cortex for global queries. (4) Enable series sampling at ingestion. (5) Use downsampling: 1 week high-cardinality + longer retention downsampled. (6) Migrate to Cortex/Mimir.

Follow-up: Design tiered query architecture handling 500K+ unique labels?

Customer requests tracking each HTTP request with request_id label. Creates millions of series/second. Explain impossibility and alternatives.

Explain limitation: Request IDs have unbounded cardinality. 1000 req/sec = 1000 new series/sec. Prometheus TSDB stores series in-memory indexed by labels. With 1B series limit, hit capacity in ~16 minutes. Breaks Prometheus' design for bounded cardinality. Suggest alternatives: (1) Distributed tracing (Jaeger, Tempo) for request-level telemetry. (2) Structured logging (ELK, Splunk, Loki). (3) Exemplars: attach request_id to trace exemplars. (4) Aggregate metrics: counts, percentiles, errors per endpoint (bounded). Recommended: Prometheus (metrics) + Jaeger (traces) + Loki (logs).

Follow-up: Design end-to-end correlation between metrics, traces, logs?

Designing SaaS platform monitoring for 1000 customers, each with 100K services and unique IDs. Naive approach = 100B series. Architecture?

Multi-tenant cardinality-aware: (1) Use Cortex or Mimir: multi-tenant Prometheus with per-tenant limits, horizontal scaling, sharding. (2) Tenant isolation: each customer separate Prometheus instance. (3) Metric relabeling enforcing contracts per tenant. (4) Write-time enforcement: validate cardinality before ingestion, reject if exceeding limits. (5) Aggregation tiers: Tier 1 (raw, 7 days) + Tier 2 (downsampled, 30 days). (6) Separate customer metrics in separate clusters per region. (7) Billing via cardinality. (8) Label mapping: UUIDs to short integers (1000x reduction).

Follow-up: Implement automatic cardinality quota enforcement with self-service overages?

Kubernetes: add service with 10K replicas. Each pod exports unique pod_id label. Need for debugging but worried about cardinality. Handle?

Pod-level isolation: (1) Annotation-based relabeling: map Pod UID (unique) to hash-based identifier. (2) Hash pod IDs to low-cardinality buckets (10K → 100). (3) Store pod mapping externally in etcd or database. (4) Use pod metadata info metrics: `kubernetes_pod_info{pod_id_hash="42", pod_name="service-xyz"}` with `group_left()` join. (5) Ephemeral ID mapping: auto-generate sequential IDs, reuse as pods churn. (6) For debugging, query logs via Loki handling high-cardinality better.

Follow-up: Design automated pod_id deduplication reusing IDs as pods churn?

Auditing exporters: legacy Java app exports 1000+ label combinations (thread ID). Fixing takes 3 months. Mitigate now?

Immediate tactical fixes: (1) Use `metric_relabel_configs` to drop thread_id. (2) If needed, aggregate: group thread IDs into bucketed ranges: `target_label: thread_group, regex: '(\d+)', replacement: '${1:2:2}'` (extracts middle 2 digits), reduces cardinality 100x. (3) Monitor: alert if exporter cardinality >100K. (4) Progressive remediation: work with team for 4-week upgrade plan. (5) Pre-deploy relabeling on canary Prometheus 1 week before fix. (6) Automated detection: query `prometheus_tsdb_symbol_table_size_bytes` per exporter.

Follow-up: Build automated exporter validation framework?

Add detailed error tracing to monitoring via stack trace label. Unbounded cardinality. Retain this data somehow?

Separate trace data from metrics: (1) Do NOT add traces as metric labels (cardinality explosion). (2) Use exemplars: attach trace IDs to samples without creating series. Configure: `exemplar: {max_exemplars: 100000}`. (3) Distributed tracing separately: OpenTelemetry → Jaeger/Tempo. (4) For stack traces, use profiling tools: Parca, Pyroscope. (5) Smart sampling: trace errors or high-latency requests (1% sampling). (6) Log aggregation: ship stack traces to ELK/Splunk/Loki. (7) Standard separation: metrics (low cardinality, queryable) + traces (high cardinality, linked) + logs (unstructured).

Follow-up: Educate teams on limits of metrics vs. traces vs. logs?