Prometheus Interview — High Availability with Thanos or Mimir

Your Prometheus instance is a single point of failure. If it crashes, all monitoring and alerting stops. You need HA: continue scraping and alerting even during maintenance or failure. How do you architect Prometheus HA?

Prometheus HA requires multiple replicas scraping the same targets independently: (1) Basic setup: run 2-3 Prometheus instances in parallel, each scraping all targets. They have separate TSDB storage (no replication). Queries hit any one instance; results are identical. (2) Load balancer: place Nginx, HAProxy, or cloud load balancer in front: Queries → Load Balancer → Prometheus 1, 2, 3. If one instance is down, load balancer routes to others. (3) Alertmanager HA: Prometheus alerts are deduplicated at Alertmanager level (not Prometheus). Multiple Prometheus instances send alerts to multiple Alertmanager replicas. Alertmanager deduplicates and routes to receivers (PagerDuty, Slack). (4) External labels: each Prometheus instance should have external_labels: { prometheus_replica: 'a', cluster: 'prod' }. This distinguishes replicas and helps with federation/deduplication. (5) Deduplication at query layer: queries can return results from any replica. Deduplication is handled by Prometheus or query proxy. If both replicas have the same metric, only one copy is returned. (6) For alerting: use Alertmanager's deduplication logic. Configure: group_by: [ 'alertname', 'cluster' ] (not 'prometheus_replica'). This ensures alerts from replica A and B are grouped together. (7) Scrape efficiency: if both Prometheus instances scrape the same targets, you double the scrape load on targets. This is acceptable for small setups. For large scale, use metrics-based deduplication (Thanos, Mimir). (8) Failover: if Prometheus 1 fails, Prometheus 2 continues scraping. Queries still work. After Prometheus 1 recovery, both resume scraping. No data loss, but some metrics might have small gaps during transition.

Follow-up: If two Prometheus replicas scrape the same target and one is slow, does the load balancer route queries to the faster one?

You've set up 2 Prometheus replicas, each storing 100GB of data. Alerting is configured on both. However, you want alerts to fire only once, not from both replicas. How do you deduplicate alerts in Prometheus HA?

Alert deduplication happens at Alertmanager, not Prometheus. Setup: (1) Each Prometheus sends alerts to the same Alertmanager (or Alertmanager HA cluster). (2) Alertmanager groups alerts by 'group_by' labels and deduplicates. Configure: alert_manager_configs: [ { static_configs: [ { targets: [ 'alertmanager:9093' ] } ] } ]. (3) Alertmanager config: global_config: { group_by: [ 'alertname', 'severity', 'instance' ] }. This groups alerts with the same name, severity, instance, and deduplicates duplicates (from both Prometheus replicas). (4) For true deduplication across replicas, Alertmanager must receive alerts from both replicas with the same labels. If Prometheus 1 sends 'up{instance="x"} == 0' and Prometheus 2 sends the same alert, Alertmanager merges them into one. (5) If you want to track which replica sent the alert: don't drop the replica label. Include external_labels in the alert: alerting rule includes labels like 'prometheus_replica: "a"'. However, Alertmanager deduplication is then keyed by (alertname, severity, instance, prometheus_replica), causing separate alerts per replica. To deduplicate, exclude prometheus_replica from group_by. (6) For HA Alertmanager: deploy 2-3 Alertmanager replicas and use gossip clustering (--cluster.* flags). Alerts are synced across replicas via gossip, ensuring consistency. (7) Monitoring: check 'alertmanager_alerts_received_total' to track alerts. If the count is 2x expected, deduplication isn't working. (8) Testing: fire an alert on one Prometheus (e.g., 'kill -9 prometheus_pid'), verify only one PagerDuty incident is created, not two.

Follow-up: If Alertmanager receives alerts from Prometheus replica A and B with different timestamps, which timestamp is used in the alert?

You're running Prometheus HA for 6 months. Each instance has 30 days of metrics, totaling 3TB of data (2 replicas × 1.5TB each). You want to query 6-month history for a compliance audit. How do you implement long-term retention in HA Prometheus?

Prometheus HA doesn't scale to long-term retention. Each replica keeps only 30 days locally (or your retention policy). For 6-month history: (1) Thanos + HA Prometheus: deploy Thanos sidecars on each Prometheus replica. Sidecars upload blocks to S3 after 2 hours. Thanos compactor merges blocks. Deploy Thanos querier as the query interface. Architecture: Prometheus 1 (30d local) + sidecar → S3; Prometheus 2 (30d local) + sidecar → S3; Thanos compactor (merges); Thanos querier (queries S3 for 6-month history + local Prometheus for latest 30d). (2) Query routing: recent queries (< 30d) hit local Prometheus (fast). Old queries (> 30d) hit S3 via Thanos (slower). Thanos deduplicates data from both replicas automatically. (3) Cost: S3 storage for 6 months ≈ 1.5TB × $0.023/GB/month × 6 months ≈ $2k. (4) Mimir: alternative: use Mimir (managed by Grafana Labs or self-hosted). All Prometheus replicas remote_write to Mimir. Mimir handles HA (replication factor 3 by default), deduplication, and long-term storage to object storage. Simpler than Thanos but adds operational overhead (run Mimir distributors, ingestors, queriers). (5) Cold storage: for compliance audit data (used rarely), archive to cheaper storage (S3 Glacier, GCS Coldline). Thanos lifecycle policies can archive blocks after 90 days. Querying archived data is slower (need to restore first). (6) Compliance requirement: if audit requires exact data (not approximations), ensure no data loss. Thanos + HA is safe (data uploaded before local retention expires). Mimir is safe (replicated). Single Prometheus + remote_write is risky (if remote fails, data is lost between failures).

Follow-up: If Thanos queries data from S3 and S3 is down, what happens to queries for old metrics?

You're planning a Mimir deployment to centralize monitoring for 100 teams. Each team should see only their metrics (multi-tenancy). Teams have different SLAs (team A needs 99.9% uptime, team B can tolerate 99%). How do you architect multi-tenant Mimir?

Mimir multi-tenancy architecture: (1) Tenant isolation: each team has a unique tenant_id. Metrics include tenant_id label or are ingested via separate API keys. Mimir separates data by tenant_id internally; team A's queries can't access team B's data. (2) Data ingestion: each team's Prometheus remote_writes to Mimir with authentication: remote_write: { url: 'http://mimir:9009/api/prom/push', basic_auth: { username: 'team_a', password: 'secret_token' } }. Mimir authenticates and routes metrics to team_a's partition. (3) Query auth: queries must include X-Scope-OrgID header: 'curl -H "X-Scope-OrgID: team_a" http://mimir:3100/prometheus/api/v1/query?query=up'. Mimir enforces tenant isolation; team_a can only query their metrics. (4) Scaling: each team's ingestion is isolated. If team A ingests 1B samples/sec but team B ingests 100M samples/sec, Mimir's distributors load-balance independently. If one team uses too much, others aren't affected (with rate limits). (5) SLA differentiation: use Mimir resource classes or quotas. Configure: team_a { max_ingestion_rate: 10000 }, team_b { max_ingestion_rate: 100 }. (6) Replication factor: Mimir has replication_factor (default 3). This applies globally, not per-tenant. If you need different RF per tenant, it's not built in (requires custom operator). (7) Retention policy: Mimir default retention is 24 hours (ingester flush interval). For long-term storage, configure object storage backend (S3) and compactor. Retention applies globally or per-tenant (configurable). (8) Monitoring: isolate Mimir metrics by tenant. Track 'cortex_ingester_memory_series_per_tenant' to identify runaway tenants. Alert if a tenant exceeds quota.

Follow-up: If a team exceeds their ingestion rate quota, are metrics dropped silently or does Mimir return an error?

You're migrating from Prometheus HA to Mimir. You have 2 weeks to migrate 100 teams' Prometheus instances without downtime. How do you execute the migration safely?

Safe Prometheus to Mimir migration: (1) Parallel running: deploy Mimir alongside existing Prometheus HA for 2 weeks. Configure Prometheus instances to send metrics to both: remote_write: [ { url: 'http://prometheus-old:9090/api/v1/write' }, { url: 'http://mimir:9009/api/prom/push' } ]. (2) Dual-write validation: run queries on both systems and compare results. Example: query 'up{team="team_a"}' on Prometheus and Mimir. Results should match (allow 1-2% variance). Use a validation script in your deployment pipeline. (3) Gradual cutover: week 1: 20 teams dual-write. Validate for 2-3 days. Week 2: another 30 teams. Week 3: remaining 50 teams. This reduces risk. (4) Query layer update: update Grafana, alerting rules, and scripts to query Mimir instead of Prometheus. Keep Prometheus as fallback for 2 weeks. Grafana data source: add Mimir, test queries, switch to Mimir. (5) Alert migration: migrate alert rules to Mimir (if using Mimir Ruler). Test alerts on Mimir for a few days. Once confident, disable Prometheus alerts. (6) Rollback plan: if Mimir has issues, switch back to Prometheus. Keep Prometheus running for 2 weeks post-migration. (7) Cleanup: after 2 weeks (time for historical data queries), decommission old Prometheus HA. Archive blocks from Prometheus to cold storage if needed. (8) Monitoring: track ingestion rate, querier latency, and storage growth during migration. Alert if Mimir metrics deviate from Prometheus (e.g., query latency increases 10x).

Follow-up: If a team's data is corrupted during migration (e.g., partial write to Mimir), how do you recover?

You're running Mimir in a Kubernetes cluster with 3 replicas (for HA). One pod is evicted due to node failure. What's the impact on data availability and ingestion?

Mimir pod eviction impact depends on pod role: (1) Distributor pod evicted: distributors are stateless and handle load balancing. If one is evicted, incoming pushes are routed to remaining distributors. Kubernetes service DNS automatically removes the failed pod. Impact: brief spike in latency (failed requests are retried), but no data loss. New distributor pod is scheduled (HPA or manual). (2) Ingester pod evicted: ingesters are stateful and store in-memory metrics before flushing to object storage. Default flush interval: 15 minutes. If an ingester with replication_factor=3 is evicted: (a) If it hasn't flushed yet, data might be lost (unless replicated to other ingesters). (b) With replication_factor=3, ingester 1 replicates to ingesters 2 and 3. Eviction of ingester 1 leaves copies on 2 and 3. No data loss. (c) If all 3 are evicted before flush, data is lost. Protect with Pod Disruption Budgets (PDB): allow only 1 pod to be disrupted at a time. (3) Querier pod evicted: queriers are stateless. Impact is similar to distributors (brief latency, no data loss). (4) Compactor pod evicted: only one active compactor at a time (leader election). Eviction pauses compaction temporarily. Impact: blocks aren't merged, storage grows. New compactor pod takes over (leader election). (5) Mitigation: (a) Set PDB: maxUnavailable: 1 to prevent multiple pods from being evicted. (b) Use persistent volumes for ingesters (if using memory only, loss risk is higher). (c) Monitor pod evictions and replace nodes promptly. (d) Ensure replication_factor >= 2 to tolerate pod loss.

Follow-up: If an ingester pod is evicted after flushing blocks to S3, but before sending acknowledgment, are blocks duplicated?

You're running Thanos with 2 Prometheus replicas sending blocks to S3. One replica goes down for 12 hours. After recovery, it catches up and starts uploading blocks. How does Thanos handle duplicate/overlapping blocks from different replicas?

Thanos deduplication for multi-replica setups: (1) Block format: each Prometheus replica uploads 2-hour blocks with metadata: { time_start, time_end, replica_label, labels }. Both replicas upload blocks for the same time window (e.g., 12:00-14:00). (2) Deduplication detection: Thanos compactor scans S3 and identifies overlapping blocks (same time window, different replicas). (3) Compactor deduplication: configurable --deduplication.enabled flag. When true, compactor merges overlapping blocks from different replicas and removes duplicates. Result: single merged block with no duplicates. (4) Recovery scenario: Replica 1 down for 12 hours (misses 6 blocks). Replica 2 uploads all blocks. After Replica 1 recovers, it uploads the 6 missed blocks (with time stamps from before the outage). Compactor detects: Replica 1's blocks [12:00, 14:00, 16:00, 18:00, 20:00, 22:00] overlap with Replica 2's blocks [12:00, 14:00, 16:00, 18:00, 20:00, 22:00]. Compactor merges each pair, keeping one copy. (5) External_labels for replica identification: each Prometheus should have unique external_labels: { prometheus: 'replica_a', cluster: 'prod' }. Thanos uses these labels to identify which replica a block came from. (6) Deduplication strategy: Thanos can --deduplication.replica-labels prometheus (keep blocks with highest replica label value, or random). (7) Cost: deduplication reduces stored blocks (saves S3 storage), but compactor must re-process blocks, adding compute cost. (8) Querying during deduplication: queries can still hit non-deduplicated blocks. For consistency, queries should ideally wait until compaction completes. Thanos handles this with soft consistency guarantees; queries may return duplicates briefly.

Follow-up: If deduplication merges blocks and the process crashes halfway, are blocks corrupted?

You're using Thanos querier to query both local Prometheus (recent 7d) and S3 (old 6-month history). A query for the last 1 year sometimes returns incomplete results. Why?

Incomplete results from Thanos queries can be caused by: (1) Query window spanning multiple store stages: recent data (Prometheus, 7d retention) and old data (S3, 6-month blocks). Thanos querier must query both: /api/v1/query with time range [6 months ago, now]. (a) Prometheus sidecar queries local 7d blocks (returns [7d ago, now]). (b) S3 store queries blocks older than 7d (returns [6 months ago, 7d ago]). (c) Querier merges results, but if timing is off, some blocks are missed. (2) Time skew between Prometheus and S3: if Prometheus system time is ahead of S3's clock, query window calculation is off. Fix: synchronize time (NTP) on all systems. (3) Block compaction delay: if compactor is behind, blocks aren't merged. Thanos querier sees individual shards instead of compacted blocks. Query might hit old shards and new blocks separately, causing gaps. (4) Query time boundaries: Thanos queries have resolution (e.g., 1-minute buckets for old data). If you query with 10-second resolution on 6-month data, results are downsampled/missing. (5) Debugging: enable Thanos debug logging: --log.level=debug. Check logs for missing stores or slow queries. (6) Monitoring: track 'thanos_query_duration_seconds' to identify slow queries. Also check 'thanos_store_*.* metrics for store latency. (7) Remediation: (a) ensure Prometheus sidecar is active and uploading blocks. (b) ensure S3 store is responding. (c) run compactor regularly to merge blocks. (d) increase query timeout if S3 is slow. (8) For reliability, use Thanos' consistency checks: queries can be retried or fall back to local only if remote fails.

Follow-up: If a query for 1-year history takes 10 minutes, is this expected or a performance issue?