Prometheus Interview — Retention, Compaction, and Storage Sizing

Your Prometheus instance stores 15 days of metrics locally with a retention policy. However, you're hitting disk limits faster than expected: after 8 days, the disk is 90% full. You have 100 million active time-series. How do you diagnose whether retention is working correctly and what's consuming disk?

Retention depends on both time and disk usage. Prometheus retains data for the specified duration (--storage.tsdb.retention.time, default 15d) but also respects --storage.tsdb.retention.size (max disk bytes, default unlimited). If disk is filling faster than expected: (1) Check cardinality: 'curl http://prometheus:9090/api/v1/series?match={__name__=~".+"} | jq '.data | length'. If cardinality is much higher than expected (> 100M), you have a cardinality explosion—investigate metric_relabel_configs and external_labels for unintended label combinations. (2) Check retention settings: 'curl http://prometheus:9090/api/v1/status/config | jq'. Look for RetentionDuration and MaxBytes. (3) Monitor disk usage: 'du -sh /prometheus' (or mount path). Check prometheus_tsdb_wal_segment_* and prometheus_tsdb_blocks_* metrics. (4) Analyze block sizes: blocks are stored in /prometheus/wal and /prometheus/wal/*/blocks. Recent blocks are smaller; older blocks are compacted and larger. A 2-hour block might be 100MB for 1B series-per-second. (5) Check for high-cardinality metrics: use 'topk(20, count by (__name__) ({ __name__=~".+" }))' to see largest metrics. (6) Calculate expected disk: cardinality × retention-days × ~1-10 bytes/sample (compressed). 100M series × 15d × ~5 bytes ≈ 7.5TB. Adjust retention or cardinality accordingly. Set retention.size: 500GB to cap disk usage.

Follow-up: If you set both retention.time=15d and retention.size=100GB, and disk fills to 100GB before 15 days, does Prometheus delete metrics older than some threshold?

Your Prometheus is configured with retention: 15d, but you're planning to query 1-year-old metrics for compliance audits. You need a way to keep long-term data without filling local disk. How do you architect long-term retention?

Prometheus local retention is limited by disk size. For long-term retention, use tiered storage: (1) Local Prometheus: keep recent data (15-30d) locally for fast queries and alerting. (2) Remote storage via remote_write: send all metrics to a backend (Thanos, Mimir, M3DB) that stores data in object storage (S3, GCS, Azure Blob). Configure: remote_write: [ { url: 'http://thanos-receiver:19291/api/v1/receive', queue_config: { max_shards: 200 } } ]. (3) Thanos architecture: Prometheus sidecars upload blocks to S3 (after 2 hours, /prometheus/wal blocks are finalized). Thanos compactor merges blocks into 1-day or larger chunks, then stores in S3. Thanos querier provides a unified query interface. Set up: Prometheus → Thanos sidecar → S3 → Thanos compactor → Thanos querier. (4) Query routing: recent data queries hit local Prometheus (fast). Old data queries hit Thanos querier, which reads from S3 (slower but cheaper). Thanos handles deduplication if multiple Prometheus instances write to the same bucket. (5) Data retention in S3: use S3 lifecycle policies to delete blocks older than 1 year or move to cheaper storage (Glacier). (6) Cost estimate: S3 storage is ~$0.023/GB/month; 100M series × 1 year = ~7.5TB = ~$1.7k/year (much cheaper than disk for long-term).

Follow-up: If you're storing 1 year of data in S3 via Thanos, how long does a query for 1-year-old metrics take compared to local 15-day data?

Your Prometheus instance has retention: 30d, and you're running out of disk space due to cardinality explosion. You decide to reduce retention to 7d to save 350GB, but some alerts and dashboards query data older than 7 days. How do you safely reduce retention without breaking queries?

Reducing retention will cause queries to fail for old data (gaps in time-series). Mitigate with: (1) First, investigate the cardinality explosion and fix it (likely the real issue). Use metric_relabel_configs to drop unnecessary high-cardinality metrics: metric_relabel_configs: [ { source_labels: [__name__], regex: 'unwanted_metric', action: 'drop' } ]. This frees up disk without reducing retention. (2) If cardinality can't be reduced, implement tiered retention: keep high-cardinality metrics for 7d, keep important low-cardinality metrics (up, node_* requests_*) for 30d. Use recording rules to pre-aggregate high-cardinality data, then drop raw series. (3) For queries older than 7d, use remote storage (Thanos/Mimir). Configure remote_read to query long-term backend. Then queries automatically fall back to remote for old data. (4) Update dashboards and alerts to reference only recent time windows (last 7d) or use remote_read. (5) For backward compatibility, use an external metrics proxy that queries both local Prometheus (recent) and remote (old), returning combined results. (6) Communicate with teams: document the new retention policy and provide alternative query paths for historical data via Thanos/Mimir.

Follow-up: If you delete a metric entirely (via metric_relabel_configs drop), are old time-series for that metric permanently lost or can you recover them?

Prometheus's TSDB compaction creates blocks every 2 hours, but during heavy traffic (5B samples/sec), the WAL (write-ahead log) is growing faster than compaction can keep up. Disk usage is increasing 50GB per hour. How do you monitor and optimize compaction?

Compaction is automatic, but under extreme load, WAL can outpace compaction. Monitor: (1) prometheus_tsdb_wal_segments_created_total and prometheus_tsdb_wal_segment_current_size (current WAL size). If growing rapidly, compaction is slow. (2) prometheus_tsdb_compactions_total and prometheus_tsdb_compaction_duration_seconds (how often compaction runs and how long it takes). If duration is high (> 1min), Prometheus is CPU/IO bound. (3) Check disk I/O: 'iostat -x' or 'iotop'. If disk I/O is saturated (> 90%), upgrade to faster SSD or distribute metrics across multiple Prometheus instances. (4) Optimize compaction: --storage.tsdb.wal-compression (compress WAL, default true in 2.40+, reduces WAL size ~3x). Set --storage.tsdb.max-wal-segments higher (default 128, ~4GB): increasing to 256 or 512 provides more buffer but takes longer to recover after crash. (5) Upgrade hardware: Prometheus benefits from fast NVMe SSDs (3k+ IOPS). HDD is too slow. (6) Reduce cardinality: if you're ingesting 5B samples/sec with 100M series, you're at the limit of local Prometheus. Consider: (a) Scaling to multiple Prometheus instances (shard by labels). (b) Using Mimir or M3DB, which scale horizontally. (c) Implementing recording rules to pre-aggregate before storage.

Follow-up: If Prometheus crashes during compaction (e.g., mid-block merge), can the TSDB be corrupted or recovered?

You're running two Prometheus instances scraping the same targets (for HA). Each instance generates 100GB/day of data. After 15 days, you have 3TB total on disk (two instances × 15d × 100GB/d). However, the data is identical—you're paying for 200% redundancy. How do you deduplicate storage across HA Prometheus instances?

Two Prometheus instances with overlapping scrapes create identical data (duplication), increasing storage costs. Solutions: (1) Thanos with object storage: both Prometheus instances write blocks to the same S3 bucket (via sidecars). Thanos deduplication removes duplicate samples from the same source (both instances scrape the same targets). However, deduplication is not automatic—Thanos stores all blocks separately. Use Thanos compactor with --deduplication.enabled=true to merge identical samples. Cost savings: instead of 3TB, store 1.5TB + overhead. (2) Single Prometheus with Alertmanager HA: rather than two Prometheus instances, run one Prometheus with redundant Alertmanager for reliability. Alertmanager is stateless, so multiple instances are true HA without duplication. (3) Remote storage deduplication: use Mimir or M3DB, which handle deduplication automatically. Both Prometheus instances write to the same Mimir cluster, which deduplicates at the ingestion layer. (4) Prometheus query deduplication: use Thanos querier, which can read from multiple Prometheus instances and deduplicate via external_labels or instance labels. Thanos chooses one replica per time-series. (5) For cost optimization, implement recording rules on one instance and store results in remote backend; both instances query the remote data, eliminating duplication at the query layer.

Follow-up: If two Prometheus instances have slightly different scrape configs (one missing a target), can Thanos deduplication cause loss of data for that target?

You've been storing metrics for 15 days locally. Now you want to migrate to a new Prometheus instance running a newer version with different compression or block format. How do you migrate historical data without losing it or downtime?

Migrating TSDB data requires careful planning: (1) Export data from old instance: use Prometheus API to query and export data, then import into new instance. However, this is slow for large datasets. Example: 'curl "http://old-prometheus:9090/api/v1/query?query=up" --data-urlencode 'time=...' > export.json'. (2) Better: use Thanos for migration. Set up Thanos sidecars on both old and new instances pointing to the same S3 bucket. Both instances write blocks to S3. When old instance is decommissioned, data remains in S3 accessible to new instance. (3) Or, copy TSDB blocks directly: old /prometheus/wal/blocks to new /prometheus/wal/blocks. However, block format may differ between Prometheus versions, so this is risky. Test with a subset first. (4) Use remote_write for future data: configure old instance with remote_write to new instance's remote_write endpoint during transition period. Both systems ingest data; queries hit both. After transition, switch to new instance. (5) Safest approach: deploy new Prometheus instance alongside old, let it scrape the same targets for 1-2 weeks. Prometheus will reach identical data. Then switch Grafana/Alertmanager to new instance. Old instance can be shut down after verification. (6) For massive datasets (10TB+), use Thanos to ensure zero data loss and minimal downtime.

Follow-up: If you copy TSDB blocks between Prometheus versions and the block format changed, what errors occur on startup?

You've sized Prometheus for 100M series × 15 days retention = ~7.5TB. But after 3 months, you've discovered that your cardinality estimate was wrong: you actually have 500M series, and disk is overflowing daily. You don't have time to rebuild infrastructure. What emergency measures can you take?

Immediate emergency measures: (1) Reduce retention aggressively: set retention: 1d or retention.size: 100GB to free up disk immediately. Accept that old data is lost. (2) Delete unwanted metrics: use metric_relabel_configs to drop high-cardinality metrics: metric_relabel_configs: [ { source_labels: [__name__], regex: '(high_cardinality_temp_.*|debug_.*)', action: 'drop' } ]. This takes effect at the next scrape cycle. (3) Restart Prometheus to force compaction: 'kill -HUP prometheus_pid' or restart. This triggers immediate compaction and frees disk. (4) Scale horizontally: deploy multiple Prometheus instances, each scraping a subset of targets. Use service discovery relabeling to shard targets by hash or namespace. (5) Use remote_write to offload: set remote_write to a temporary backend (Mimir, Thanos, or cloud storage) and reduce local retention to 1d. New data writes to remote; local keeps only recent. (6) Identify and kill problematic exporters: some targets might be exposing 100k+ metrics each. Use 'curl -s http://target:port/metrics | wc -l' to identify. Contact exporter owners to reduce cardinality. (7) Long-term: implement cardinality limits (--metric-relabel-configs with hard drop rules), sharding, or upgrade to Mimir/Cortex.

Follow-up: If you reduce retention to 1d during an emergency, can Prometheus queries for 7d data still work via caching or federation?

Your Prometheus retention policy is set to 15d, but you notice prometheus_tsdb_blocks_loaded_total is 50 (representing 50 blocks × 2 hours = ~4 days of loaded data), not 15 days. Why are only 4 days of blocks loaded and how does this affect queries?

Block loading vs. retention retention are different concepts. Prometheus keeps data on disk for the retention period, but doesn't load all blocks into memory. Instead: (1) Recent blocks (last 2-3 hours) are in WAL (write-ahead log) and memory cache. (2) Older blocks (> 2 hours) are on disk but not loaded unless queried. (3) prometheus_tsdb_blocks_loaded_total counts blocks currently in memory. 50 blocks × 2-hour block size ≈ 4 days loaded, which is normal. The other 11 days of data are on disk but not in memory. (4) Queries for data older than 4 days will load blocks from disk into memory on-demand, which adds latency. Subsequent queries for the same block are fast (cached). (5) Prometheus memory is bounded: use --storage.tsdb.max-block-chunk-segment-size (default unlimited, but roughly 1-2% of available RAM). If you query many old blocks simultaneously, memory usage spikes. (6) To keep more blocks in memory, increase available RAM. To reduce memory usage, increase retention.size or reduce retention.time. Check prometheus_tsdb_memory_chunks_total (in-memory chunks) and prometheus_tsdb_compaction_chunk_size_bytes. (7) For production, expect 5-20% of retention time to be in-memory (4-6 days out of 30d is normal and acceptable).

Follow-up: If you have 30 days retention but only 4 days loaded into memory, and you query 60d back in time, what happens?