Prometheus Interview — TSDB Storage Engine Internals

Prometheus disk filling: 500GB/week, 15-day retention. Reduce disk usage 40% without losing critical data.

Implement multi-pronged compression: (1) Change block duration 2h → 24h via `--storage.tsdb.min-block-duration` to reduce metadata and WAL. (2) Enable compression (v2.40+). (3) Reduce cardinality via relabeling before ingestion. (4) Use downsampling: 15d full resolution + 1y downsampled via Thanos. (5) Tiered storage: move old blocks to S3/GCS. (6) Increase scrape intervals from 15s to 30-60s.

Follow-up: What's the risk of 2h → 24h block duration and how mitigate?

Malformed metric corrupts WAL. Prometheus won't start. 1 hour before next incident.

Rapid recovery: (1) Stop Prometheus immediately. (2) Check WAL integrity at `{data_dir}/wal/`. (3) Use `prometheus --storage.tsdb.wal-corruption-recovery=truncate` (v2.45+). (4) Recover from latest block if needed. (5) Use `/api/v1/admin/tsdb/clean_tombstones` API. (6) Verify recovery by querying known metrics.

Follow-up: How would you set up automated WAL integrity checks?

Analyzing storage: using 2TB for expected 500GB data. Memory and compaction healthy. Where to investigate?

Systematic storage analysis: (1) `promtool tsdb analyze` to see block structure. (2) Analyze cardinality via `promtool tsdb dump`. (3) Check exemplar storage size. (4) Verify tombstone cleanup from deletions. (5) Check WAL segments total size. (6) Look for duplicate metrics from label variations.

Follow-up: Implement automated storage quota enforcement?

Upgrade Prometheus v2.30 → v2.45. Concerned about storage format changes and backward compatibility.

Comprehensive testing: (1) Staging environment mirroring production. Snapshot and restore. (2) Run both versions in parallel—v2.45 read-only first. (3) Compare query results, compute correlation. (4) Measure performance: latency, compaction time. (5) Verify storage compatibility after v2.45 reads data.

Follow-up: What would you do if v2.45 shows 30% query latency regression?

Dimensioning new cluster: 50 microservices, 1000 instances, 200K series/instance, 30-day retention, <100ms p99 latency.

Capacity planning: (1) Total series: 50×1000×200K = 10B, assume 10x = 100M practical series. (2) Storage: 100M × 30 × 1KB = 3TB + 20% = 3.6TB minimum. (3) Memory: 100M series × 1KB = 100GB. Use 64-128GB instances. (4) CPU: ~40K series/sec ingestion, one instance handles ~100K series/sec. (5) Recommend: 1 ingestion + 3 query instances with 128GB RAM each.

Follow-up: How would you scale to 10x series count maintaining latency?

Compaction taking longer weekly (30m → 90m). Query latency degrading. Not keeping up.

Diagnose compaction: (1) Monitor `prometheus_tsdb_compaction_duration_seconds`, `prometheus_tsdb_symbol_table_size_bytes`. Linear growth = unbounded cardinality. (2) Analyze block sizes with `du -sh`. >5GB blocks indicate high cardinality. (3) Check CPU/IO: compaction CPU-bound. (4) Fix cardinality: relabeling high-cardinality labels before ingestion. (5) Increase concurrency: `--storage.tsdb.max-block-chunksize=8194304`.

Follow-up: Automate cardinality monitoring to prevent degradation?

Customer wants ad-hoc SQL queries on Prometheus data. API too slow for analytics aggregations.

Analytics pipeline: (1) Remote write to S3/ClickHouse/BigQuery for warehouse. (2) Batch export: periodic API queries to Parquet/CSV files. (3) Thanos: query blocks from object storage. (4) Separate ClickHouse/Druid: ingest via remote write, SQL queries 100x faster. (5) Grafana export feature for smaller datasets.

Follow-up: Keep analytics warehouse in sync without duplication?

During incident, analyze what metrics were scraping but now missing. Restart loses WAL data.

Rapid diagnostic capture: (1) Enable `--web.enable-admin-api` before restart. (2) Export state via `/api/v1/admin/tsdb/stats` for series counts per metric. (3) Query metrics directly: `curl 'http://localhost:9090/api/v1/query' --data-urlencode 'query=count({{__name__=~".+"}})'`. (4) Use `/api/v1/targets` endpoint for target health JSON. (5) Capture logs with `--log.level=debug`. (6) Create snapshot: `tar czf prometheus-diagnostic-dump-$(date +%s).tar.gz {data_dir}/`.

Follow-up: Automate metric loss detection to alert before data lost?