Grafana Interview — Loki Log Aggregation and LogQL

Your microservices emit 2TB of logs daily across 50 services. Kibana (old stack) is costing $50K/month. You're migrating to Grafana Loki for log aggregation, but your team doesn't know LogQL. They're used to Kibana's Lucene syntax. Design a migration strategy that gets the team productive on LogQL while minimizing training overhead and ensuring log queryability isn't degraded during migration.

Implement a three-phase migration: (1) Parallel ingestion—send logs to both Kibana and Loki simultaneously for 4-6 weeks, enabling side-by-side query validation. Developers run queries in both systems, learning LogQL gradually while verifying results match. (2) Query translator—build a Lucene-to-LogQL transpiler that converts simple Kibana queries to LogQL, helping teams quickly find equivalent queries. Document common patterns (field matching, regex, aggregations). (3) Runbooks and cookbook—create a LogQL cheat sheet (20-page) with patterns: "find errors in service X," "calculate P95 latency," "group logs by user." Provide 3 worked examples per pattern. Set up office hours for LogQL questions, rotate teaching through the team. Implement a gradual cutover: week 1, move 10% of logs to Loki-only; measure query performance. Week 2-4, scale to 50%. Week 5, complete migration. For the Lucene to LogQL bridge, store both query syntaxes in runbooks, then deprecate Lucene over 6 months. Train team leads first (2 weeks), then broader team (4 weeks) before full cutover. Use Grafana's Loki query builder UI for teams uncomfortable with LogQL syntax—it generates LogQL queries visually, providing educational value.

Follow-up: A junior dev writes a LogQL query that scans all 2TB of logs for a simple field match, taking 5 minutes. Their Kibana equivalent took 5 seconds. What's the root cause, and how would you teach query optimization?

Your Loki cluster stores 2TB of logs daily. After 30 days, your index grows to 60TB, consuming most of your storage budget. You need a data tiering strategy: hot data (last 7 days) must be queryable in <1 second, warm data (7-30 days) can tolerate 10-second queries, cold data (30+ days) should be archived to S3. Design a log retention and tiering architecture.

Implement a three-tier storage strategy: (1) Hot tier (Local SSD, Loki instances)—store last 7 days of logs on high-speed storage. Set retention policy in Loki to delete logs older than 7 days from this tier. (2) Warm tier (S3 + Loki table manager)—use Loki's schema configuration to route logs older than 7 days to S3 (cheaper storage). Queries against warm data are slower but tolerable. (3) Cold tier (S3 Archive)—logs older than 30 days are compressed and moved to S3 Glacier for long-term retention. Enable S3 lifecycle policies to automatically archive. For querying cold data, use a search service (Athena, Spark) that scans S3 asynchronously; results are available via a separate query interface. Implement a retention policy enforcing minimum 30 days for compliance, maximum 1 year for cost control (customers can't exceed this). Monitor storage usage via Prometheus metrics; auto-alert if hot tier exceeds 80% capacity. For emergency deep-dives, provide a "cold log search" API that queries S3, returning results within 5 minutes but with higher latency. Document tiering strategy in runbook; train team on when to use each tier for investigations.

Follow-up: A compliance audit requires you to retain logs for 7 years. Your cold archive tier is set to delete after 1 year. How would you reconfigure tiering without exploding storage costs?

A team is running a LogQL query: `{job="api"} | json | response_time > 1000`. The query takes 30 seconds. You need it <5 seconds for real-time debugging. The Loki instance is healthy (no CPU/memory issues). Walk through your debugging approach to identify the bottleneck and optimize.

Implement systematic query optimization: (1) Profile the query—enable Loki query logging to see per-stage latency breakdown. Check if latency is in label matching, line filtering, JSON parsing, or metric computation. (2) Add stream matchers—the query `{job="api"}` matches all streams; if you have 100K streams with that label, Loki scans all. Add more specific matchers: `{job="api", environment="prod", region="us-east"}` to narrow stream set. (3) Use LogQL pipeline stages efficiently—filtering early is cheaper. Rewrite to `{job="api"} | response_time > 1000 | json` (filter before parsing). (4) Consider pre-aggregation—if the query is for daily statistics, use a recording rule (Loki LogQL pre-computed metrics) instead of querying raw logs. (5) Use labels over line filters—if response_time is a structured field, add it as a label at ingest (Promtail relabeling), enabling faster filtering. (6) Cache repeated queries—for dashboards querying the same range repeatedly, enable query caching. (7) Sample if possible—for exploratory queries, use `| sample(100)` to get a statistically representative subset faster. Profile each optimization's impact; document in team runbook. Set performance budgets (e.g., "real-time queries must be <5s") and alert if violated.

Follow-up: You added a label for response_time at ingest, reducing query time from 30s to 5s. But now all logs have inflated label cardinality (1M unique response_time values). What's the cardinality impact, and would you revert the change?

Your Loki cluster is 4 instances, no replication. An instance crashes unexpectedly. You lose all logs on that instance—30 hours worth. Your team had an incident during those 30 hours and can't now reconstruct the root cause. Design a high-availability architecture that prevents data loss and enables incident post-mortems.

Implement HA for Loki using: (1) Replication—configure replication factor of 3 for all logs. Each log entry is written to 3 Loki instances. If one instance fails, replicas prevent data loss. (2) Distributed storage backend—instead of local disk, use a distributed storage backend (S3, MinIO) for Loki's index and chunks. All instances read/write to shared storage; if one crashes, others access the same data. (3) Read replicas—set up read-only Loki instances that query shared storage but don't accept writes. During incident, read replicas buffer queries, distributing load. (4) WAL (Write-Ahead Log)—configure Loki's WAL to persist all ingested logs to distributed storage before acknowledgment, ensuring durability even if instance crashes mid-write. (5) Backup strategy—set up periodic snapshots of Loki's index and chunks to a separate S3 bucket, enabling recovery if all instances fail simultaneously (rare but catastrophic). For incident post-mortems, retain logs for minimum 30 days; archive older logs but keep searchable for 1 year. Implement a "time-travel" feature allowing teams to query logs during past incidents. Set up monitoring of replication health; alert if any replica is behind or unavailable. Test failover procedures monthly: kill an instance, verify data is accessible from replicas, measure recovery time.

Follow-up: Replication adds 3x ingestion latency (write to all 3 instances sequentially). Your log ingestion pipeline is now a bottleneck. How would you speed up replication without sacrificing durability?

Your microservices generate 2TB of logs/day. Most are debug logs with low signal. You're paying for storage/compute of logs that nobody ever queries. Design a cost-optimization strategy that reduces log volume without losing critical data for debugging.

Implement multi-level log filtering and sampling: (1) At source—use Promtail relabeling to drop low-value logs (debug logs outside business hours, health check logs, repeated errors >10 occurrences/minute). Configure via policy YAML, adjustable per service. (2) Sampling by severity—ingest 100% of ERROR/FATAL logs, 50% of WARN, 10% of INFO, 1% of DEBUG. Use Promtail's sampling config to drop logs at ingest based on log level. (3) Time-based compression—during off-peak hours, sample more aggressively (1% INFO, 0.1% DEBUG) since queries are rare. (4) Cardinality limiting—if a field has high cardinality (user_id with millions of values), hash/truncate it at ingest to reduce index size. (5) Log aggregation—replace individual duplicate logs with counters: instead of 1000 "connection timeout" logs, store one log with "count=1000". (6) Dynamic filtering—monitor query patterns; if a log pattern is never queried in 7 days, reduce sampling to 0.1% and alert teams to either query it or it will be deleted. Implement cost observability: per-service log ingestion cost, cost per query, estimated cost reduction from filters. Store filtering policies in Git; run quarterly reviews to adjust sampling. Document the tradeoff: reduced logs = faster investigation but less comprehensive post-mortem data. For critical services, maintain higher sampling; for experimental services, use aggressive sampling.

Follow-up: You implemented aggressive sampling. During an incident, the exact error logs you need were sampled out (0.1% sampling = only 1 in 1000 errors logged). The incident team is stuck. How do you balance cost and debuggability?

Your compliance team requires all logs to be searchable for forensic investigations. You're storing logs in Loki, but Loki doesn't provide structured indexing for full-text search across all fields. When an auditor asks "find all logs mentioning 'credit_card' in the last 6 months," your query times out. Design a log compliance and auditability system.

Implement a dual-store architecture: (1) Loki (operational)—stores hot logs (recent, queryable) optimized for performance. (2) ElasticSearch/Datadog (compliance)—mirrors all logs to ES for full-text, arbitrary-field search. Use a log processor (Logstash, Filebeat) to duplicate logs to both systems. (3) Structured extraction—at ingest, parse logs and extract sensitive fields (credit_card, ssn, api_key) as structured fields. Tag logs containing sensitive data with a "has_pii" label. (4) Retention policies—in Loki, delete logs after 30 days (cost control). In ES, retain for 2 years for compliance. For even older logs, archive to S3 with a separate query interface (Athena). (5) Search APIs—expose two query endpoints: "fast search" (Loki, 30-day window), "compliance search" (ES, 2-year window, slower, audit-logged). (6) Audit logging—log all compliance searches (who searched, what query, results returned). (7) PII detection—implement a rule-based PII scanner (regex for credit card patterns, phone numbers). When PII is detected, redact in search results for users without compliance clearance. For forensic investigations, provide a secure audit trail viewing interface with access controls and approval workflows. Implement compliance reporting: monthly audit of searches, data classification, retention compliance. Train team on sensitive data handling; document in runbooks.

Follow-up: A developer accidentally searched the compliance log index for PII, exposing customer credit cards. Your audit logging didn't prevent access. How would you implement least-privilege access to sensitive logs?

Your team builds LogQL dashboards and alerts based on logs. During deployment, you change the log format slightly (rename field from "error_msg" to "error_message"). All downstream queries break silently—dashboards show no data, alerts don't fire. How would you prevent schema drift and catch incompatible log format changes?

Implement a log schema versioning system: (1) Define log schemas in code—specify expected fields, types, and value ranges for each log source in a schema file (JSON Schema or Avro). Store schemas in Git, versioned. (2) Schema validation at ingest—when Promtail receives logs, validate against schema. If a log doesn't match (e.g., missing field), tag it with a "schema_version_mismatch" label and route to an error stream for investigation. (3) Schema evolution policies—define breaking vs. non-breaking changes. Non-breaking: adding new optional field, widening a type (string -> text). Breaking: removing field, narrowing type. Document the policy in Git. (4) Schema registry—store all active schemas in a Git registry. Before deployment, run a schema compatibility check: "is new schema backwards-compatible with queries that depend on the old schema?" If breaking changes detected, require explicit approval. (5) Query validation—for LogQL queries referencing fields, validate the field exists in the active schema at query time. Alert if a query references a deprecated field. (6) Gradual rollout—when deploying a log format change, emit both old and new field names (e.g., both "error_msg" and "error_message") for 2 weeks, giving downstream queries time to migrate. (7) Dashboard versioning—version all LogQL queries used in dashboards; when schema changes, the dashboard system auto-migrates queries where possible, or alerts on manual updates needed. Implement a "schema impact analysis" report showing which dashboards/alerts would be affected by a schema change.

Follow-up: Your schema says "error_message" is a string, but in production, some errors have error_message = null (JSON null). Your dashboards assume non-null strings. How would you handle schema violations in production data?

You're debugging a production issue. You need logs from 6 months ago, but your Loki instance only stores 30 days hot + 60 days in warm tier (S3). Querying S3 via Loki takes 5 minutes. You need results in <30 seconds. Design an efficient cold log querying system that doesn't break the bank on compute.

Implement tiered log querying with progressive disclosure: (1) Hot query (Loki, <1s)—try query against hot tier first. If data exists, return immediately. (2) Warm query (S3 + Loki, 10-30s)—if hot misses, query warm tier. Loki's S3 backend provides indexed lookup, faster than full scan. (3) Cold query (Athena/Spark, 1-5min)—if warm misses, query archived S3 data via Athena. Use Athena's partition pruning to scan only relevant date ranges. Return results with a note about latency. (4) Asynchronous cold search—allow users to submit cold queries and poll for results later, rather than blocking. (5) Indexed snapshots—for frequently queried date ranges (e.g., "major incident on March 15"), create indexed snapshots stored separately, enabling faster cold queries. (6) Query hints—when a user specifies a date range, pre-compute which tier contains the data and estimate query time. For "logs from 6 months ago," estimate 3-5 minutes and ask if they want to proceed. (7) Cost modeling—show cost of cold query to user before executing (Athena charges per GB scanned). For very large scans, suggest alternative: "bulk export to S3 for offline analysis ($X vs $Y for interactive query)." Implement query caching aggressively: if the same query is run multiple times, cache results in S3, serving from cache if request is identical. Document cold query capabilities in runbook; train team on query patterns, date ranges, and cost implications.

Follow-up: Your Athena cold queries are scanning 500GB (estimated cost $2.50 per query). A user wants to query logs from every day over 6 months. At $2.50/query = $375+ total cost. Is there a better approach?