Grafana Interview — Mimir and Long-Term Metrics Storage

You're running Prometheus with local storage, keeping 15 days of metrics. Your team wants to run year-over-year comparisons (current week vs. same week last year). You're also paying for expensive external storage. Migrate to Mimir for long-term metric storage while maintaining backwards compatibility with Prometheus API queries.

Implement Mimir as a long-term backend: (1) Parallel ingestion—configure Prometheus with a remote write target pointing to Mimir. Prometheus writes all metrics to both local storage and Mimir simultaneously (dual-write). (2) Query routing—configure Prometheus as the primary query target (fast, local). For queries exceeding local retention, proxy to Mimir. Use Trickster or a reverse proxy to implement query routing logic. (3) Gradual cutover—run dual-write for 2-4 weeks, validating that Mimir data matches Prometheus local. Then reduce Prometheus retention to 7 days, keeping Mimir as the long-term store. (4) Mimir configuration—use Mimir's distributed architecture: scalable ingesters for write, queriers for read. Use S3 for chunks storage (cost-effective, durable). (5) Retention policies—in Mimir, store all metrics for 1+ years. Configure Mimir to delete metrics older than 1 year or per-series retention based on metric type (high-cardinality metrics deleted sooner). (6) Cost optimization—use Mimir's compaction to compress metrics in S3, reducing storage 10-100x. Use table manager for lifecycle policies (move old data to cheaper storage tiers). (7) Query latency—implement caching for popular queries to reduce latency. For long-range queries (year-over-year), pre-compute aggregates (daily summaries) to speed up queries. Implement alerting on query latency; document expected latency for different time ranges (1 week = <1s, 1 year = 5-10s).

Follow-up: Your dual-write to Prometheus and Mimir is working, but a 1-second write latency to Mimir is causing Prometheus to timeout. Reducing timeouts loses metrics. How would you handle slow remote write without losing data?

Your microservices generate 5M metrics/day. After 1 year of storage in Mimir, your S3 bucket contains 10TB of compressed metrics. Queries for recent data (last 7 days) are <1s, but queries for year-old data take 30+ seconds (scanning multiple S3 objects). Design a tiered storage strategy that keeps query latency consistent across time ranges.

Implement intelligent tiered storage: (1) Hot tier (local cache)—most recent 7 days of metrics kept in Mimir's in-memory cache and local SSD for <1s query latency. (2) Warm tier (S3 standard)—metrics 7-90 days old stored in S3 Standard. Queries scan S3 with parallel reads, 5-10s latency. Implement prefix-based partitioning (by day/week) to reduce scan scope. (3) Cold tier (S3 Glacier)—metrics >90 days archived to Glacier. Queries against cold data are asynchronous: submit query, receive results in 1-5 minutes. (4) Parallel object retrieval—for warm/cold queries, parallelize S3 object retrieval: request multiple objects simultaneously instead of sequentially. (5) Query result caching—cache query results aggressively. If "show me metrics for Jan 15 2025" is queried twice, serve from cache. (6) Pre-aggregation—for long-range queries, pre-compute daily/weekly aggregates stored in warm tier. Queries against aggregates are 90% faster. (7) Index optimization—use Mimir's index to quickly determine which time range contains data, reducing S3 scan scope. Implement per-metric retention: store low-cardinality metrics (e.g., deployment_status) for 5 years; high-cardinality metrics (e.g., per-instance CPU) for 1 year. Set up cost analysis dashboard showing storage cost by tier; automate archive decisions based on query frequency: "cold metrics never queried in 30 days → delete from Glacier."

Follow-up: Your pre-aggregates for year-old data are correct 80% of the time, but occasionally show anomalies compared to original data. You can't trust year-old rollups. What validation approach would you implement?

Your Mimir cluster is experiencing high load during peak hours (9-10 AM). Queries slow down from <1s to 5+ seconds. Your infrastructure team can't add more capacity until next month. Design a query optimization strategy that reduces latency during peak load without major infrastructure changes.

Implement query optimization and load management: (1) Query profiling—identify slow queries by instrumenting Mimir with detailed metrics. Log query type, time range, cardinality, execution time. Identify top-10 slowest queries. (2) Query rewriting—for identified slow queries, rewrite to be more efficient: break large time ranges into smaller chunks, request fewer series/labels, use functions that optimize locally (rate vs. deriv). (3) Result caching—aggressive caching for repeated queries. Most queries are repeated within 1-5 minutes; serving from cache reduces Mimir load 50%+. Implement cache invalidation based on metric staleness. (4) Request queuing—during peak load, queue requests; serve with priority (SLAs for critical queries). Non-critical queries might wait 5-10 seconds; critical queries get priority. (5) Request prioritization—define query tiers: P0 (dashboards showing current status, <2s), P1 (historical analysis, <10s), P2 (bulk exports, no SLA). During overload, queue P2 requests. (6) Batch query consolidation—if 100 dashboards request the same metric during peak, consolidate into a single Mimir query, broadcast results to all. (7) Query limits—implement rate limiting: "each tenant can issue max 100 queries/minute." Excess queries are queued or rejected. Implement a "query budget" system: high-load tenants have lower budgets; low-load tenants can burst. Set up monitoring of query latency by query type; alert if P0 latencies exceed SLA. During overload events, trigger auto-scaling alerts for infrastructure team.

Follow-up: Your query caching is working well, but cached results are stale (10 minutes old) during peak load. A team is seeing out-of-sync data between dashboards and alerts. How would you handle cache staleness vs. load?

Your Mimir cluster stores metrics from 500 microservices. One service has a runaway process generating 1M new metric combinations/hour (e.g., per-request metrics with unique request IDs as labels). This "cardinality explosion" is consuming all your storage and slowing queries. Design a cardinality management system that prevents and recovers from explosions.

Implement cardinality management and enforcement: (1) Cardinality budgets—assign each service a cardinality budget (max metric combinations). Monitor via Prometheus metrics. Alert when approaching 80% of budget. (2) Cardinality monitoring—track per-service cardinality growth over time. Identify services with anomalous growth (>10x day-over-day). Alert ops and service owner immediately. (3) Rejecting ingestion—when a service exceeds cardinality budget, reject new metrics from that service. Return 429 (Too Many Requests) to the ingestor, signaling the service to slow down. (4) Root cause identification—when cardinality explosion detected, analyze which metric and which label caused the explosion. Log details and alert service owner with specifics. (5) Metric relabeling—implement Prometheus relabeling rules to drop or aggregate high-cardinality labels: replace per-request IDs with request_type, replace unique user IDs with user_tier. (6) Recovery—for services that exceeded budget, implement slow drain: gradually reduce stored metrics (delete oldest data first) over 24 hours to bring cardinality under budget. (7) Forecasting—use time-series analysis to predict cardinality growth trends. Alert if a service is on trajectory to exceed budget in 1 week. Implement a cardinality dashboard showing: top 20 services by cardinality, growth rate, budget utilization. Set runbook for service owners: troubleshooting checklist, metrics to review, relabeling examples. Document cardinality budgets in service SLOs.

Follow-up: Your cardinality quota rejection is too aggressive—it's rejecting legitimate metrics from a service that auto-scales. When traffic spikes, the service generates new instances (new labels), exceeding quota, then gets rejected. How would you distinguish explosions from legitimate scaling?

Your Mimir stores metrics with different levels of detail: detailed 1-second resolution for recent data, 1-minute resolution for older data. A user queries a 1-year range expecting 1-second resolution but gets 1-minute resolution, causing confusion. Design a transparent metric downsampling system that maintains user expectations.

Implement transparent downsampling with user awareness: (1) Downsampling tiers—define explicit downsampling rules: <7 days = 1s resolution, 7-90 days = 1m resolution, >90 days = 1h resolution. Document this clearly. (2) Automatic downsampling—at ingest, store metrics at configured resolution. Older metrics are automatically downsampled (aggregated) to coarser resolution. (3) Query resolution negotiation—when a user queries with a time range, Mimir determines the applicable resolution. If requesting 1s resolution for 1-year data (impossible), return: "1-year range not available at 1s resolution. Available at 1m resolution. Fetching 1m data..." with a warning banner. (4) Function adaption—some functions break with coarse resolution. For example, rate() on 1-hour resolution loses meaning. Mimir should detect this and suggest alternatives: "rate() not applicable to 1-hour resolution; use counter delta instead." (5) On-demand upsampling (limited)—for small ranges, allow upsampling: "user wants 7-day data at 1s resolution. Reconstruct from raw data if available, otherwise interpolate." This is expensive, so limit to <7 days and cache results. (6) Pre-aggregation—compute common aggregations (sum, avg, max, min) at each resolution level, storing them separately. Queries asking for "average latency over 1 year" get pre-computed aggregates, avoiding raw data scan. (7) Documentation—in dashboards/alerts, annotate expected resolution and time range limitations. Implement a "resolution" field in query responses showing actual resolution used. Set up training: document downsampling strategy, show examples of appropriate queries for each time range.

Follow-up: Your 1-hour downsampled data for 1-year range loses important detail. A team wants to run 1-year analysis but at 1-minute resolution (not feasible). How would you support this without exploding storage?

Your Mimir deployment spans 3 cloud regions. Metrics from each region are stored locally, but you need to run company-wide queries (e.g., "what's the global request rate across all regions?"). Currently, these queries require manual aggregation or exporting data to a central store. Design a federated query system for Mimir.

Implement federated Mimir querying: (1) Query gateway—build a query proxy layer that accepts queries and routes them to regional Mimir instances. The gateway collects results from all regions and merges them. (2) Distributed execution—for a query like "sum(requests_total)", decompose into sub-queries per region: "sum(requests_total) from region-us-east", etc. Execute sub-queries in parallel against each region's Mimir. (3) Result merging—collect per-region results and perform final aggregation (sum across regions). Return merged results to user. (4) Latency optimization—query only regions containing data for the requested time range. Use metadata caches to know which regions have data. (5) Partial failures—if 1 of 3 regions times out, return results from 2 regions with a note: "us-west region timed out; results are partial." (6) Caching—cache region-level query results. If multiple queries are made for the same region/time range, serve from cache. (7) Query planning—for complex queries, optimize execution order: "run fast sub-queries first; if they return few series, skip expensive sub-queries." Implement query metrics: execution time per region, region latency distribution. Alert if any region is slower than SLA. Set up a federated query dashboard showing: available regions, query success rate per region, latency by region. Document use cases (global dashboards, cross-region comparisons) and provide examples. Handle data residency: for queries touching regulated data, ensure only approved regions are queried.

Follow-up: Your federated queries are slow—us-west region is slower than eu-central, delaying all global queries. How would you address regional latency imbalances?

Your Prometheus sends remote writes to Mimir. During a network partition, writes to Mimir fail. Prometheus local storage fills up in 8 hours. You need a queue-and-replay mechanism to prevent metric loss during extended outages.

Implement reliable remote write with queue resilience: (1) Local WAL (Write-Ahead Log)—Prometheus writes to local disk before sending to Mimir. If remote write fails, local WAL preserves the data. (2) Persistent queue—in Prometheus remote write client, maintain a persistent queue of failed writes. On disk, this queue survives Prometheus restarts. (3) Backoff strategy—implement exponential backoff with jitter: on failure, wait 1s, then 2s, then 4s, etc. This avoids thundering herd when service recovers. (4) Adaptive retention—if remote writes are failing, increase Prometheus local retention dynamically (up to max). Once writes succeed, gradually reduce retention to target. (5) Replay mechanism—once Mimir is available again, replay queued writes in order. Process at rate Mimir can handle (adjust based on ingest rate). (6) Monitoring—expose metrics: queue depth, failures per minute, replay progress. Alert if queue exceeds 80% of local storage. (7) Manual intervention tools—provide tools to inspect queue, replay specific time ranges, or manually drop problematic batches. Document troubleshooting runbook. For extended outages (>24 hours), implement spillover to secondary storage (local disk, separate S3 bucket) when local retention fills. Create automation to detect outages >6 hours and alert ops for manual intervention (scale Prometheus local storage or deprioritize low-priority metrics).

Follow-up: Your queue has 2TB of failed writes (24 hours of outage). Replaying 2TB at normal ingest rate will take 48+ hours. Mimir will be overwhelmed. How would you handle massive queue replay without overloading?

Your Mimir cluster is at 90% capacity. Infrastructure team says adding storage takes 4 weeks. You can't reduce metrics (no cardinality explosion). Implement a strategy to reduce storage consumption immediately without losing critical data.

Implement aggressive but intelligent cost reduction: (1) Retention reduction—identify low-value metrics (debug metrics, rarely queried metrics) and reduce their retention from 1 year to 90 days. Calculate storage savings. (2) Resolution reduction—for non-critical metrics, reduce resolution from 1s to 10s. This reduces storage 10x. (3) Sampling—implement metric sampling: keep 100% of error metrics, 50% of anomalous metrics, 10% of normal metrics. Use reservoir sampling to ensure representative data. (4) Label dropping—review metrics for unnecessary labels: drop request_id (too high cardinality), keep service (useful for aggregation). Update Prometheus relabeling to drop these before remote write. (5) Aggregation—compute pre-aggregations for common queries (daily sum/avg/max). Store aggregates separately, delete raw data after 30 days. (6) Compression tuning—enable Mimir's aggressive compression: trade CPU for storage (compress 2:1). (7) Deduplication—if metrics are ingested from multiple sources, deduplicate and keep only primary source. Combine all the above strategically: might reduce retention by 50%, resolution by 5x, sampling reduces volume 50%, labels reduce cardinality 30%. Combined = ~75% storage reduction. Run projections before implementing: "if we do X, we save Y storage, we lose Z observability." Present tradeoffs to stakeholders. Implement changes gradually: apply to least-critical services first, monitor impact, expand if acceptable. Set up monitoring of storage growth post-optimization; ensure sustainable trajectory.

Follow-up: Your aggressive reduction (retention to 90 days, 10s resolution, 10% sampling) saves 75% storage, but teams are complaining they can't debug long-running issues (e.g., "how was our deployment 2 months ago?"). The tradeoff is too painful. What's the right balance?