Prometheus Interview — Capacity Planning and Scaling Prometheus

You're sizing a new Prometheus instance for your infrastructure. You'll monitor 500 targets, each exposing 2000 metrics, with a 15-second scrape interval and 30-day retention. Estimate CPU, memory, and disk requirements. What's your formula?

Prometheus sizing formula: (1) Time-series cardinality: 500 targets × 2000 metrics/target = 1M series. (2) Memory estimation: ~1-2 KB per series in RAM (includes indices, WAL, caches). At 1M series, expect 1-2 GB. Add 10% for buffers/overhead = ~2-3 GB total. (3) Disk estimation: ~1-10 bytes/sample (compressed). At 15s interval, that's 4 samples/minute/series. Per day: 1M series × 4 samples/min × 1440 min = 5.76B samples/day. At 5 bytes/sample = 28 GB/day. For 30-day retention = 840 GB (~1 TB). Account for overhead (WAL, index) = 1.2-1.5 TB. (4) CPU: depends on query complexity and scrape frequency. Base load: 15-20% CPU for scraping + compaction. Add 10% per concurrent query. For 10 concurrent queries = 15% + 10% = 25-30% CPU. Use 2-4 CPU cores. (5) Network: 500 targets × 2000 metrics × 4 samples/min = 4M samples/min = ~2 MB/s inbound. (6) Recommended hardware: CPU: 4 cores (2.5GHz+), Memory: 8-16 GB (SSD-backed), Disk: 2-4 TB NVMe SSD (fast I/O for compaction). (7) For production, scale conservatively: plan for 50% headroom. 2x the cardinality = 2M series, 4x the retention = 120 days. Use the formula iteratively: after 3 months of operation, measure actual cardinality and adjust.

Follow-up: If cardinality exceeds your estimation by 5x, can you upgrade the instance in-place without downtime?

Your Prometheus is now at capacity: CPU is 90%, memory is 85%, disk is 95% full. Adding more targets will break it. You can't upgrade hardware quickly. What immediate steps do you take to make room?

Emergency capacity relief: (1) Reduce retention immediately: set retention: 7d instead of 30d. Prometheus will start dropping old blocks at next compaction cycle (~2 hours). Free up ~24 × 3 = 72 GB quickly. (2) Drop unnecessary metrics: use metric_relabel_configs to drop high-cardinality metrics: metric_relabel_configs: [ { source_labels: [__name__], regex: 'debug_.*|temp_.*', action: 'drop' } ]. This takes effect at next scrape (~15s). (3) Reduce scrape frequency: increase scrape_interval from 15s to 30s. Cuts samples/series/day by 50%, halving disk growth. (4) Lower query cache: set query.cache_value_ttl to 30s (default higher), reducing memory for query results. (5) Force compaction: restart Prometheus or send SIGHUP. This triggers immediate compaction and frees memory. (6) Scale horizontally (temporary): deploy a second Prometheus instance for a subset of targets (shard by label). Use service discovery relabeling to split load. (7) Implement remote_write: start writing to a backend (Thanos, Grafana Cloud) immediately, reducing local storage pressure. New samples go to remote; local keeps only recent 1-7 days. (8) Long-term: upgrade hardware (add CPU, RAM, SSD), implement sharding (multiple Prometheus instances), or migrate to a distributed backend (Mimir, Cortex). Prioritize: high-cardinality metric → drop it; low-importance metric → reduce retention. Communicate with teams about SLA impact (lower retention = less historical data available).

Follow-up: If you reduce retention to 7d and receive a compliance audit request for 90d of metrics, what do you do?

You're planning to scale Prometheus from 1M to 10M active time-series for a growing organization. You can't run a single massive instance (hardware limitations). How do you architect sharded Prometheus?

Sharded Prometheus architecture: (1) Shard by labels: split targets into N Prometheus instances, each responsible for a label subset. Example: shard by 'region', 'team', 'service', or 'target_hash'. Use service discovery relabeling: keep_if_equal or drop based on hash(instance) mod N. (2) Relabeling example: relabel_configs: [ { source_labels: [instance], target_label: shard, regex: '(.*)_([0-9]+)', replacement: '${2}', action: 'replace' }, { source_labels: [shard], regex: '({{ "{{" }}PROMETHEUS_ID{{ "}}")) }', replacement: '${1}', action: 'keep' } ]. (3) Query layer: place a reverse proxy (Nginx, Thanos querier) in front of all Prometheus replicas. Queries are distributed to all shards and results are merged. (4) Deduplication: if shards have overlapping targets, use external_labels to tag by shard: external_labels: { prometheus_shard: '1' }. Query layer deduplicates or unions results. (5) Alerting: each Prometheus shard has its own alert rules. For global alerts, either (a) run them on each shard independently, (b) centralize alert evaluation using a separate global Prometheus that queries shards via remote_read, or (c) use a backend like Mimir that centralizes alerting. (6) Recording rules: each shard evaluates rules independently. For global recording rules, either shard-specific (aggregate within shard) or centralized (evaluate on global instance). (7) Capacity: 10M series = 10 × 1M series = 10 shards (each 1-2 GB memory). CPU load is distributed: 4 cores × 10 = 40 cores total (more than single 4-core instance). (8) Tooling: use Prometheus federation or Thanos to aggregate metrics from shards for dashboards. Thanos handles query distribution and deduplication transparently.

Follow-up: If you have 10 Prometheus shards and one crashes, what happens to queries? Do they fail or return partial results?

Your organization runs Prometheus locally in each datacenter (10 datacenters). Each has 1M series. You need a global view (10M series total). However, consolidating into a single global Prometheus would require 50 GB RAM (impractical). How do you scale to global visibility?

Multi-datacenter monitoring requires a distributed query layer: (1) Federation: set up a global Prometheus that scrapes /federate endpoint from each regional Prometheus. However, /federate doesn't scale well for 10M series (each query causes re-evaluation on 10 regional instances). (2) Thanos: deploy Thanos sidecars on each regional Prometheus (uploads blocks to S3). Deploy Thanos querier (HA by default). Thanos querier reads from all sidecars in parallel, deduplicates, and returns results. Thanos handles 10M series easily via parallel query distribution. (3) Mimir/Cortex: multi-tenant backend that handles multi-datacenter by design. All regional Prometheus instances remote_write to Mimir. Mimir stores data in object storage (S3). Mimir querier provides global query API. (4) Architecture comparison: (a) Federation (simple but slow): 10 regional instances → global instance via /federate. Query latency: 5s+ per query (fans out to all regions). (b) Thanos (fast, S3-native): 10 regional instances → sidecars → S3 → querier. Query latency: 1-2s (parallel reads from S3). (c) Mimir (most scalable): 10 regional instances → Mimir distributor → storage → querier. Query latency: 100-500ms (optimized for high concurrency). (5) Cost-benefit: Federation is free (no extra components). Thanos requires S3 (~$0.02/GB/month) but no additional compute. Mimir requires compute (distributors, queriers, ingestors) but handles massive scale (100M+ series). (6) For 10M series at 10 datacenters: (a) Start with Thanos + S3 (cost ~$2-5k/year for storage). (b) If queries are slow, add Mimir. (c) Monitor prometheus_query_duration_seconds to track query latency.

Follow-up: If each datacenters Prometheus has 1M series, but they're not identical (each has different labels), can you deduplicate across datacenters?

You're running 100 microservices, each emitting 100 metrics, totaling 10M series. Each service needs its own Prometheus for isolation (security, performance). But queries spanning multiple services fail or are slow. How do you scale to a multi-tenant Prometheus architecture?

Multi-tenant Prometheus architecture requires careful design: (1) Per-service Prometheus: each service has its own Prometheus instance (2GB memory, 1 core). 100 services = 100 Prometheus instances = 200GB RAM total (expensive). (2) Query aggregation layer: place a query proxy (e.g., Thanos querier, custom Go service) in front. Proxy accepts a query and routes to relevant Prometheus instances based on service labels. Results are merged. (3) Service discovery: each Prometheus only scrapes its own service's targets. Use Kubernetes namespace labels or explicit target groups. Example: service A's Prometheus queries 'job="service_a"', service B's Prometheus queries 'job="service_b"'. (4) Query routing: when a user queries 'http_requests_total', the proxy checks which services expose this metric, queries those Prometheus instances, and merges results. Challenges: (a) need a service inventory/catalog, (b) query routing logic is complex. (5) Alternative: shared Prometheus with tenant isolation. Use Prometheus remotely scraped metrics and multitenancy at the application layer. All services write to a single Mimir/Cortex backend, which handles multi-tenancy. Queries are automatically filtered by tenant_id. (6) Hybrid: split by team, not service. Team A has 20 services in one Prometheus (2M series), Team B has 30 services in another. Results in fewer instances (5 instead of 100), easier query aggregation. (7) Cost-benefit: per-service Prometheus is isolated (good for security) but expensive (100 instances). Shared backend (Mimir) is more cost-effective but requires careful RBAC setup. For 100 services, use shared Mimir with strong RBAC, avoid 100 individual Prometheus instances.

Follow-up: If you're using multi-tenant Prometheus and tenant A's queries accidentally include tenant B's metrics, how do you prevent this security issue?

Your Prometheus processes 10B samples/second ingestion rate (from remote_write, federation, heavy scraping). Compaction can't keep up. WAL is growing at 100GB/hour. What do you do to handle extreme ingestion rates?

10B samples/sec is beyond single Prometheus's capacity (limit ~1M samples/sec per core). Solutions: (1) Upgrade hardware: extreme CPU/RAM/SSD. 10 cores × 1M samples/sec/core = 10M samples/sec max. For 10B/sec, need 1,000 cores (impractical). (2) Shard ingestion: split remote_write sources across multiple Prometheus instances. Use a load balancer or Prometheus remote_write with multiple targets: remote_write: [ { url: 'http://prometheus1:9090/api/v1/write' }, { url: 'http://prometheus2:9090/api/v1/write' }, ... ]. Each instance handles 100M samples/sec (achievable). (3) Use Mimir/Cortex: designed for 100B+/sec globally. Remote_write from all sources to Mimir. Mimir distributors load-balance across ingestors (typically 10-50 ingestors). Each ingester handles ~1B samples/sec. (4) Optimize TSDB: (a) Reduce cardinality: drop high-cardinality metrics via metric_relabel_configs. (b) Increase WAL segments: --storage.tsdb.max-wal-segments to 512 (from default 128). (c) Use compression: --storage.tsdb.wal-compression=true. (d) Fast SSD: NVMe 3k+ IOPS. (5) Downsampling: before ingestion, pre-aggregate metrics using recording rules, reducing cardinality. For example, replace raw 100k-cardinality metric with aggregated 1k-cardinality version. (6) Sampling: drop low-importance metrics before ingestion. If you have debug_* metrics with 1B samples/sec, drop them entirely. (7) For 10B samples/sec at scale, Prometheus is not the right tool. Use Thanos, Mimir, Cortex, VictoriaMetrics, or M3 which are built for extreme scale.

Follow-up: If you shard Prometheus across 10 instances and one instance falls behind (higher WAL size), does it impact queries?

You're running Prometheus in Kubernetes with dynamic scaling. Pod count varies from 10 to 1000 depending on traffic. Your Prometheus scrapes all pods, so cardinality also varies (10k-100k series). How do you size Prometheus for dynamic workloads?

Dynamic workloads require elastic Prometheus sizing: (1) Vertical scaling (resize existing instance): use Kubernetes HPA on Prometheus CPU/memory metrics. Define target: CPU: 60%, Memory: 70%. HPA adjusts resource requests/limits automatically. However, Prometheus doesn't share data (each instance is separate), so scaling doesn't help if a single instance has high cardinality. (2) Horizontal scaling (multiple instances): shard Prometheus by namespace or pod label. Each shard scrapes a subset of pods. Use StatefulSet with VolumeClaimTemplates to provide persistent storage. (3) Cardinality estimation: at peak (1000 pods), cardinality is ~100k series. At trough (10 pods), ~1k series. Size Prometheus for peak: (a) memory: 100k series × 1.5 KB = ~150 MB (request 500 MB to account for overhead), (b) disk: 100k series × 5 samples/min × 5 bytes/sample × 30 days = 1.08 GB/day × 30 = 32 GB (request 100 GB for headroom). (4) Reserved resources: set CPU requests to handle scrape intervals (15s). At 100k series, ~10% CPU per core. Request 1-2 cores. (5) Auto-cleanup: implement a cleanup job that prunes targets no longer in Kubernetes (pods terminated). Otherwise, cardinality grows unbounded. Query service discovery API, identify offline targets, and manually delete via relabeling. (6) Thanos for dynamic environments: all Prometheus shards write to S3 via Thanos sidecars. New shards are added/removed dynamically; Thanos querier automatically discovers them. (7) Monitoring: track 'prometheus_tsdb_metric_chunks_created_total' to monitor cardinality growth in real-time. Alert if growth rate exceeds threshold (indicates new high-cardinality metric introduced).

Follow-up: If you're running Prometheus in Kubernetes with Thanos sidecars, and a pod is evicted, how long before Thanos stops uploading from that instance?

You've implemented recording rules across multiple Prometheus instances (sharded). Each instance evaluates the same rules independently, creating duplicate recorded metrics. You need a single copy of each recorded metric, not duplicates across shards. How do you deduplicate recorded metrics?

Recording rule duplication happens when multiple Prometheus instances evaluate identical rules independently. Deduplication strategies: (1) Centralized rule evaluation: instead of each Prometheus shard evaluating rules, have a single central Prometheus that queries all shards via remote_read and evaluates rules. This requires a separate Prometheus instance (overhead). (2) Thanos recording rules: Thanos querier can act as an aggregation layer. Configure Thanos to evaluate rules (Thanos Ruler component) instead of individual Prometheus instances. Thanos Ruler queries all Prometheus replicas and evaluates rules once, storing results in object storage. (3) Shard-specific recording rules: each shard evaluates rules for its own data only. Example: shard 1 evaluates rules for job="service_a"; shard 2 for job="service_b". No duplication because each shard operates on different data. Use routing: relabel_configs: [ { source_labels: [job], regex: 'service_a', action: 'keep' }, { source_labels: [__meta_kubernetes_shard_id], regex: '1', action: 'keep' } ]. (4) Deduplication at query time: accept duplicate recordings; query deduplicates on retrieval. Use Thanos querier with deduplication enabled or Grafana's deduplication feature. (5) For global metrics (e.g., total requests across all shards), create separate rules that span shards. Use Thanos Ruler or a central Prometheus that remote_reads from all shards: global:requests:total = sum(rate(requests_total[5m])) across all shards. (6) Cost-benefit: centralized rule evaluation adds complexity and a potential SPOF (central Prometheus instance). For most use-cases, accept duplicate recordings and deduplicate at query layer via Thanos.

Follow-up: If you're using Thanos Ruler for global recording rules and it crashes, how long before metrics stop being updated?