Grafana Interview — Migration from Kibana or Datadog

Your company is paying $500K/year for Datadog for monitoring (metrics, logs, APM). You want to migrate to Grafana + open-source stack (Prometheus, Loki, Tempo) to reduce costs. But 200 teams depend on Datadog. Switching breaks workflows. Design a migration strategy that minimizes disruption.

Implement phased Datadog-to-Grafana migration: (1) Parallel ingestion—send metrics/logs to both Datadog and Grafana stack simultaneously for 4-6 weeks. Teams explore Grafana while relying on Datadog. (2) Dashboard migration—high-priority dashboards are manually recreated in Grafana. Lower-priority dashboards can migrate later. (3) Team engagement—involve team leads in migration. Train on Grafana features. (4) Quick wins—highlight early wins: "Grafana cost 10x less," "query performance 2x faster." Build momentum. (5) Gradual cutover—per-team cutover: Payment team moves to Grafana week 1, Frontend week 2, etc. Not all-at-once. (6) Runbook preparation—document Grafana equivalents: "in Datadog you do X, in Grafana you do Y." (7) Support—establish Grafana support team. Teams contact them for help, issues are resolved fast. For cost calculation, show ROI: "migrating to Grafana saves $400K/year." Use savings to justify effort. Implement Datadog-to-Grafana query translator: help teams convert Datadog query syntax to PromQL/LogQL. Not automatic (complex), but helps. Build feature parity checklist: Datadog feature → Grafana equivalent. Identify gaps. For unsupported Datadog features, implement workarounds or escalate. Test end-to-end workflows: simulate team using Grafana for debugging. Measure time-to-insight: should be similar to Datadog. Create incentives: early-adopter teams get recognition, become experts. Measure adoption: % of dashboards in Grafana, % of queries, % of teams migrated. Target: 100% in 6 months.

Follow-up: Your migration is progressing (60% of teams on Grafana), but early-adopter teams report: "Grafana doesn't have feature X from Datadog. We need it to be productive." Feature gap is blocking migration. How would you handle missing features?

You're migrating from Kibana to Grafana Loki for log aggregation. Your team has 10K complex Kibana dashboards with Lucene queries. Manually converting each to LogQL will take 6 months. Design an automated migration tool to convert dashboards.

Implement automated dashboard/query migration: (1) Query parser—parse Kibana Lucene queries, extract fields, filters, aggregations. (2) LogQL generator—convert Lucene AST to LogQL queries. Handle common patterns. (3) Unsupported patterns—if Lucene query pattern is not supported in LogQL, flag for manual migration. (4) Dashboard migration—parse Kibana dashboard JSON. For each panel, convert query, update panel structure to Grafana format. (5) Visualization mapping—map Kibana visualizations to Grafana panels. Most Kibana visualizations map cleanly (table, timeseries). (6) Metadata preservation—preserve dashboard names, descriptions, ownership tags. (7) Validation—after conversion, validate queries syntactically and functionally. For samples, verify results match Kibana. Build the migration tool as Python script (or Grafana plugin): teams run tool, get Grafana dashboard JSON export. Implement interactive migration: tool shows preview of converted dashboard. Users can adjust mappings before finalizing. For unsupported patterns (~10% of dashboards), tool provides manual migration guide: "this Lucene pattern (X) is not supported in LogQL. Manual approach: Y." Create quality metrics: % of queries auto-converted successfully, % requiring manual work, % validated. Test extensively: convert sample Kibana dashboards, verify results. Compare Kibana output to Grafana output; should be equivalent. Document migration: how to use tool, common issues, manual migration examples.

Follow-up: Your auto-conversion tool converted 8000 of 10K dashboards (80%). The remaining 2000 have complex Lucene queries that don't map cleanly. How would you prioritize and handle the manual 2000?

You're migrating from Datadog to Grafana. Datadog has built-in anomaly detection: AI-powered automatic anomaly detection on metrics. Grafana doesn't have built-in equivalent. Teams used anomaly detection for alerting. Design a solution for anomaly detection in Grafana stack.

Implement anomaly detection: (1) Statistical anomalies—implement z-score or IQR-based anomaly detection in Prometheus recording rules. Compute mean/stddev of metrics over time. (2) Baseline comparison—track historical baselines (7-day running avg). Alert if current metric deviates >2 stddevs. (3) Seasonal adjustment—for metrics with seasonal patterns (traffic higher during day, lower at night), adjust baseline seasonally. (4) Grafana expressions—use Grafana server-side expressions to implement anomaly logic: if(metric > baseline * 1.2, alert). (5) External ML—integrate external ML service (Moogsoft, Anomalio) for advanced anomaly detection. Call from Grafana. (6) Alerting—combine anomaly score with traditional alerting: fire alert if (metric > threshold OR anomaly_score > 0.8). (7) Learning—track false positives from anomaly detection. Feedback improves model over time. Build anomaly dashboard: shows baseline, actual metric, anomaly score. Visual anomaly highlighting. Implement anomaly tuning: teams can adjust sensitivity per metric (conservative, balanced, aggressive). Create anomaly-based alert templates: for new services, auto-create anomaly alerts (don't need to set manual thresholds). Provide comparison report: Datadog anomalies vs. Grafana anomalies. Measure detection accuracy. Document anomaly logic: so teams understand what counts as anomaly.

Follow-up: Your anomaly detection works, but during a legitimate traffic ramp (deployment day), anomaly detector fires. True traffic spike is not an anomaly. False positive wastes on-call time. How do you differentiate anomalies from expected changes?

You're migrating from Datadog APM to Grafana Tempo. Datadog APM has service dependency maps (visual graph showing services and interactions). Grafana Tempo doesn't have equivalent. Teams relied on dependency maps for debugging. Design service dependency visualization for Grafana.

Implement service dependency visualization: (1) Trace dependency extraction—from Tempo traces, extract service dependencies: if trace shows API → Database, that's a dependency. (2) Dependency aggregation—across all traces, build dependency graph: count requests per service-pair, latency per hop. (3) Real-time graph—build Grafana panel showing live dependency graph (nodes = services, edges = requests). (4) Latency visualization—edge thickness or color represents latency: thick = high latency, thin = low. (5) Error visualization—edge color indicates error rate: green = healthy, red = high errors. (6) Drill-down—clicking edge shows traces for that service-pair. Debug specific interactions. (7) Export—export dependency graph as topology diagram (for architecture documentation). Implement automatic dependency discovery: monitor traces, detect new services/dependencies automatically. Add to graph. Implement dependency health monitoring: alert if new service appears (unexpected dependency), or latency between services spikes. Create dependency change detection: compare dependency graph over time. Alert on structural changes (removed services, new dependencies). Build dependency dashboard: shows current topology, recent changes, health metrics per service. Implement filtering: show dependencies for specific service (drill into Payment service dependencies). Test dependency visualization: simulate microservice traces, verify dependency graph is correct. Compare to Datadog dependency map.

Follow-up: Your dependency graph is useful, but traces from 50 microservices create a dense graph (1000+ edges). It's unreadable. How would you simplify without losing information?

You're migrating from Datadog (single vendor). Grafana stack is best-of-breed (Prometheus, Loki, Tempo, Grafana UI). But now you own operational complexity: 6 components to run, upgrade, maintain. Datadog was simpler (vendor handles it). How do you manage complexity post-migration?

Implement operational simplification: (1) Infrastructure-as-code—define entire stack (Prometheus, Loki, Tempo, Grafana) in Terraform/Helm. Deployment is one command. (2) Observability of observability—monitor the monitoring stack itself: Prometheus scraping Prometheus, Loki ingesting logs from Loki. Self-monitoring enables quick issue detection. (3) Runbook automation—package common operations (upgrade, scale, troubleshoot) as automated runbooks. Reduce manual effort. (4) Monitoring dashboards—dashboards showing health of each component (CPU, memory, error rate, latency). Alert on component health. (5) Documentation—comprehensive runbooks for operating stack. Reduce context loss when teams change. (6) Component consolidation—where possible, consolidate components. Use single alerting system (not separate per component). (7) Support escalation—establish internal support team (or hire vendor for managed stack) to handle complex issues. For cost tracking, compare: Datadog $500K/year vs. Grafana stack running costs (infrastructure + staff time). Should be <$200K/year. For risk management, maintain Datadog as fallback during transition: keep Datadog running for first 6 months of Grafana migration. If Grafana stack fails catastrophically, fall back to Datadog without losing monitoring. Build operational excellence: run chaos engineering on Grafana stack. Simulate component failures, verify recovery. Test upgrade procedures monthly. Create competency framework: train on-call engineers on Grafana stack architecture, common issues, troubleshooting.

Follow-up: Your operational documentation is comprehensive, but only 2 people on the team understand the entire Grafana stack. They're bottlenecks. If they leave, team is lost. How do you prevent knowledge silos?

You've successfully migrated 80% of dashboards from Datadog to Grafana. But 20% are still in Datadog: too complex to migrate, or unsupported features. Now teams have to check both systems. How do you provide unified view across Datadog and Grafana?

Implement unified multi-vendor observability: (1) Datadog integration—embed Datadog dashboards in Grafana using iframes. Single pane of glass. (2) API bridging—Grafana panel pulling from Datadog API. Display Datadog metrics in Grafana. (3) Unified alerting—combine Datadog alerts and Grafana alerts into single alerting system. (4) Correlation—trace ID correlation between Datadog traces and Grafana traces. Link across systems. (5) Search federation—search interface that queries both Datadog and Grafana. Results unified. (6) Metadata normalization—normalize field names between Datadog and Grafana. Dashboard queries reference normalized fields. (7) Gradual consolidation—continue migrating remaining 20%. Set milestone: "remove Datadog entirely by end of Q3." Build a migration status dashboard: shows % dashboards migrated, top blocker issues preventing migration, per-team migration status. Create decision document: for each remaining Datadog dashboard, decide: migrate, build workaround, or keep in Datadog. For "keep in Datadog" cases, document why. For cost tracking, show: "migrating X dashboard saves $Y/month." Set ROI targets. Implement feedback loop: teams using both Grafana and Datadog provide feedback. Common feedback: "Grafana missing feature Z." Prioritize feature development accordingly. Test unified experience: teams use Grafana+Datadog together. Measure workflow efficiency. Identify pain points.

Follow-up: Your unified view across Datadog and Grafana works, but managing the integration code is becoming a burden. Every Datadog API update requires Grafana integration code changes. How would you reduce integration maintenance?

You completed the Datadog-to-Grafana migration. But 6 months later, a new team joins with fresh Datadog contract (negotiated separately). They resist migrating to internal Grafana (they want to use what they're familiar with). Now you have two monitoring systems in the org. How do you handle divergence?

Implement standardization and governance: (1) Monitoring standard—define org-wide monitoring standard: "all teams use Grafana + Prometheus/Loki/Tempo. Exceptions require VP approval." (2) Datadog exceptions—document why this team needs Datadog (if justified). Set sunset date: "Datadog until end of 2026, then migrate." (3) Cost accountability—charge team for their Datadog usage. Making cost visible incentivizes migration. (4) Training—offer Grafana training to new team. Reduce "unfamiliar" objection. (5) Integration incentives—offer free integration work: help them migrate dashboards, set up alerting. Reduce migration effort. (6) Peer pressure—highlight cost savings: "team A saved $100K by migrating." Motivates other teams. (7) Compliance—if monitoring system needs compliance cert (SOC2, HIPAA), only approve Grafana or other compliant systems. Exclude non-compliant Datadog. Implement chargeback model: each team's monitoring cost is visible and charged to their budget. This creates incentive to optimize and migrate. Build a "monitoring platform" team: dedicated engineers who help teams migrate, support Grafana deployment, troubleshoot issues. This reduces friction. Create governance checklist: before approving new tool/vendor, go through approval process. Ensure consistency. Set long-term vision: "single source of truth for monitoring is Grafana stack. All other tools are tactical, temporary, with sunset dates." Document this vision. Build stakeholder alignment: CFO wants cost reduction, CTO wants simplicity, Security wants compliance. Show how Grafana migration achieves all three.

Follow-up: Your governance is reasonable, but the new team's leadership refuses to comply ("we have budget for Datadog"). They continue using Datadog. Now Grafana team is resentful (building Grafana while others use Datadog). Org has two monitoring cultures. How do you resolve?