Prometheus Interview — Remote Write and Long-Term Storage

Your Prometheus instance is hitting disk space limits after 2 weeks of retention. You need 1 year of metrics for compliance audits, but local SSD storage is cost-prohibitive. How do you implement remote write to a long-term backend while keeping recent data locally?

Configure Prometheus with remote_write to send all metrics to a long-term storage backend (Thanos, Mimir, M3DB, or cloud services like Grafana Cloud, Datadog). Set retention: 15d locally and remote_write to the backend. Prometheus sends samples to the remote endpoint asynchronously; if the backend is slow or down, samples queue in WAL (write-ahead log) and are retried. Example config: remote_write: [ { url: "http://thanos-receiver:19291/api/v1/receive", queue_config: { max_shards: 200, capacity: 2500, max_retries: 3, min_backoff: 30ms, max_backoff: 100ms } } ]. The queue_config controls parallelism and retry behavior. Set max_shards high (100-200) for high-cardinality setups. WAL persists up to storage.tsdb.wal-compression capacity (default ~4GB). For compliance, verify the backend supports at least 1-year retention and enables authentication/encryption in transit.

Follow-up: If remote write fails persistently (backend down for 24 hours), what happens to the WAL and metrics? How do you prevent data loss?

You've set up remote write to Thanos, but Prometheus memory usage is growing linearly because remote write isn't keeping pace with ingestion rate. Every 15 seconds, a backlog accumulates. How do you diagnose and fix this bottleneck?

Remote write backlog is monitored via prometheus_remote_storage_samples_pending metric. If it's growing, remote write isn't draining fast enough. Diagnose with: (1) Check prometheus_remote_storage_samples_dropped_total—if it's increasing, samples are being dropped due to queue overflow. (2) Check prometheus_remote_storage_shard_capacity and shards_desired to see if queue_config.max_shards is too low. (3) Monitor network latency and backend latency (curl -H "Content-Encoding: snappy" to the backend's remote_write endpoint). If latency is high, increase max_shards to 200-400 and capacity to 5000. (4) Reduce sample rate: implement recording rules in Prometheus to pre-aggregate high-cardinality metrics before sending to remote. (5) Enable compression: queue_config.min_backoff and max_backoff shouldn't be too aggressive; default 30ms/100ms is reasonable. (6) Upgrade to Prometheus 2.40+, which has better remote write parallelism. For 10M+ series, use max_shards: 500.

Follow-up: What's the relationship between remote_write max_shards, cardinality, and memory overhead? Can excessive max_shards cause OOM?

Your organization uses Thanos for long-term storage and wants to query metrics from both local Prometheus (recent data, full resolution) and Thanos (old data, compacted blocks). However, queries are returning different results when queried from Prometheus vs Thanos. How do you reconcile data consistency between local and remote storage?

Prometheus and Thanos have different data models and retention policies, causing inconsistencies. Prometheus stores raw samples in 2-hour blocks locally. Thanos sidecars upload these blocks to object storage, then Thanos compactor merges them into 1-day or larger blocks. When querying via Thanos (which queries both local Prometheus and Thanos blocks), you get both datasets. However, if Prometheus's local retention (15d) overlaps with Thanos blocks, you see duplicate data. Configure: (1) Use retention: 1d locally, so local data doesn't overlap with Thanos. (2) Set remote_write to Thanos with queue overflow handling. (3) On the Thanos side, use --downsampling.resolution flags (raw=1s, 5m aggregation, 1h aggregation) to control resolution by age. (4) Query exclusively from Thanos querier (not direct Prometheus), which deduplicates overlapping time windows. (5) For consistency, use the same __name__ and label conventions in both systems (Prometheus scrape config and Thanos relabel rules).

Follow-up: If Thanos downsampling is enabled, how do you query fine-grained metrics older than the downsampling threshold? Is raw data lost?

You're sending metrics to Grafana Cloud (managed remote storage) but want to implement tiered retention: store 1 year in Grafana Cloud, 5 years in S3 for compliance/audit. How do you architect multi-destination remote write?

Prometheus supports multiple remote_write destinations. Add two remote_write blocks: remote_write: [ { url: "https://prometheus-blocks-prod-us-central1.grafana.net/api/prom/push", basicAuth: { username: "...", password: "..." }, write_relabel_configs: [ { source_labels: [__name__], regex: "...", action: "keep" } ] }, { url: "http://thanos-receiver-s3-backend:19291/api/v1/receive", relabel_configs: [ { source_labels: [__name__], regex: "...", action: "keep" } ] } ]. Use write_relabel_configs to filter which metrics go where: high-volume metrics (node_* temp_*) go to Grafana (lower cost), compliance metrics go to S3 backend. However, this doubles ingestion overhead. Optimize by: (1) Use filtering to send only necessary metrics to each backend. (2) Compress both: Prometheus automatically snappies remote writes. (3) For compliance S3 storage, compress further with Thanos compactor. Alternatively, use a single backend (Thanos/Mimir) with tiered storage (local + S3) to avoid duplication.

Follow-up: If write_relabel_configs filters out a metric from one remote destination but not another, can this cause query inconsistencies?

Your Prometheus remote write to Mimir is working, but you notice the WAL (write-ahead log) is filling up disk quickly (~100GB per day). How do you optimize WAL behavior and prevent disk exhaustion?

WAL stores samples before they're acknowledged by remote storage. Large WAL indicates remote write is slow or falling behind ingestion. Monitor: (1) prometheus_remote_storage_samples_total vs prometheus_remote_storage_samples_dropped_total. If samples are being dropped, WAL capacity is exceeded. (2) Check prometheus_tsdb_wal_truncations_total—if high, samples are being truncated. Default WAL capacity is 4GB; if you're hitting it, either increase storage.tsdb.max-wal-segments (currently 128 segments of 32MB = 4GB), or fix remote write performance. (3) Measure remote_write latency: curl the Mimir ingester's /metrics and check mimir_request_duration_seconds. If Mimir ingest latency > 1s, throttle ingestion rate in Prometheus or add more Mimir distributors. (4) Reduce cardinality: fewer series = smaller WAL. Use metric_relabel_configs to drop high-cardinality metrics. (5) Increase compression: set storage.tsdb.wal-compression: true (default in Prometheus 2.40+). (6) For extreme volume, shard metrics: send node_* to one Prometheus, container_* to another, so each has lower WAL overhead.

Follow-up: What's the trade-off between WAL capacity and recovery time after a crash? If you set max-wal-segments very high, can crash recovery become slow?

You're implementing read-from-remote-storage to query old metrics directly from the backend without storing them locally. However, queries for 1-year-old data are slow (10+ seconds). How do you optimize remote read performance?

Remote read queries hit the remote backend (Thanos, Mimir) for data older than local retention. Slow queries indicate: (1) Backend latency is high (Thanos querier scanning millions of blocks). (2) The query is too broad (high cardinality). (3) Object storage latency is high (S3 GetObject calls are slow). Optimize: (1) Use time-series aggregation and recording rules: instead of querying raw metrics for the last year, query pre-aggregated recording rules (e.g., 5m average). Set recording rule evaluation on a separate Prometheus and remote-write the results. (2) Query only necessary time window: specify exact time range in Grafana/Alertmanager to avoid full-range scans. (3) Enable Thanos caching (Prometheus itself caches query results in memory; Thanos can use Redis or memcached for block queries). (4) Increase Thanos querier resources (CPU, RAM, --max-concurrent) to parallelize queries. (5) Use downsampling in Thanos: --downsampling.resolution flags reduce resolution for old data, making queries faster but less granular. (6) For massive 1-year queries, accept eventual consistency: use Grafana's caching layer to cache results for 1 hour.

Follow-up: If you query 1 year of data with 1-second resolution, how many samples are returned? Can this cause OOM in Grafana or the browser?

You've configured remote write to multiple backends (Grafana Cloud + S3), but a bug in the relabel_configs sent the same metrics to both, causing high data duplication costs. How do you prevent dual-write bugs and implement cost controls?

Prevent duplication by: (1) Using separate Prometheus instances or recording rules for different backends (if splitting data). (2) Testing relabel_configs thoroughly before deployment: use promtool to validate syntax, then test against live traffic using a canary Prometheus. (3) Implementing write_relabel_configs strictly: label_replace or drop rules should explicitly exclude metrics from secondary backends. (4) For cost control: (a) implement metric_relabel_configs to drop high-cardinality or unnecessary metrics before any remote write; (b) use external_labels to tag all metrics with the destination backend, then query billing by backend; (c) set up CloudWatch or GCP budget alerts on data ingestion; (d) monitor prometheus_remote_storage_bytes_total to track bytes sent per remote endpoint. (5) Use Prometheus's --enable-feature=auto-gomemlimit to prevent OOM crashes during high write volume. (6) Implement a canary setup: route 1% of traffic to new backends first, measure cost/performance, then roll out to 100%.

Follow-up: If you accidentally send 10M series to an expensive backend for a week, how do you clean up and estimate the damage before paying the bill?

Your team is migrating from Prometheus local storage to a managed backend (Grafana Cloud / Datadog). During the migration, you need to run both systems in parallel for validation. How do you set up dual-write with consistency checks?

Set up Prometheus with remote_write to both the old (local Prometheus) and new (Grafana Cloud) backends simultaneously. Old Prometheus runs normally (keeps local storage). New remote_write sends to Grafana Cloud. Configure: remote_write: [ { url: "http://old-prometheus:9090/api/v1/write" }, { url: "https://grafana-cloud-endpoint/api/prom/push" } ]. Monitor both systems independently. For consistency checks: (1) Run parallel queries on both systems: query_old = query(old-prometheus); query_new = query(grafana-cloud); diff = abs((query_old - query_new) / query_old). Alert if diff > 1% for critical metrics. (2) Use Prometheus query API and compare results in a background job. (3) Write test queries that should return identical results on both systems. (4) For slow/validation phase, write to the new backend first (with write_relabel_configs to sample 10% of traffic), then gradually increase to 100%. (5) After migration validation, switch queries to new backend and sunset the old Prometheus. Use remoteread_sample_limit to ensure queries don't timeout during this phase.

Follow-up: During dual-write, if the new backend is slower (adding latency to remote_write queue), how does this affect scrape latency and query performance on the old system?