Grafana Interview — Dashboard-as-Code and Provisioning

You're managing 200 dashboards across 15 microservices. The team manually creates/updates dashboards in the UI, causing inconsistencies and lost work when Grafana instances restart. You need to version-control all dashboards and deploy them reproducibly across dev/staging/prod. Walk through your Dashboard-as-Code strategy—how would you structure provisioning files, handle secrets in queries, and ensure dashboard drift doesn't happen?

Implement a three-tier approach: (1) Use Grafana's provisioning system with YAML-based dashboard definitions stored in Git, supporting conditional variable injection per environment. (2) Structure directories by team/service with a shared library of reusable row panels and query templates. (3) Automate provisioning via CI/CD—generate dashboard JSON from templates, validate schema, and deploy using Grafana's /api/v1/admin/provisioning/dashboards endpoint or terraform-provider-grafana. Use environment-specific values files (dev.yml, prod.yml) for datasource references, alert thresholds, and retention policies. Implement a GitOps workflow where dashboard PRs auto-deploy to staging, requiring manual approval for production. Track dashboard versions via Git tags, enabling rollback to known-good configs. Store sensitive query parameters (DB passwords, API keys) in Grafana's datasource configuration, not dashboard JSON—reference them via variables or interpolation.

Follow-up: You notice dashboard provisioning fails on some instances but not others. Logs show "invalid JSON schema." How would you diagnose dashboard definition inconsistencies across environments, and what automated validation would you add to your CI/CD pipeline to catch these before deployment?

Your organization has 50 team-specific dashboards that share 80% of panels but have slight variations in thresholds, queries, and drill-down targets. Manually maintaining these is a nightmare. Design a templating system that lets teams override specific panels without duplicating the entire dashboard definition.

Build a meta-templating layer using Jsonnet or Helm for dashboard generation. Define base dashboard skeletons in Jsonnet with parameterized panels, variables, and alert rules. Each team gets a small configuration file specifying overrides: custom thresholds, datasource bindings, and panel queries. The build pipeline uses Jsonnet to merge base + team config, generating final dashboard JSON. This approach enables: (1) single source of truth for common panels, (2) team autonomy for customizations without full dashboard duplication, (3) version-controlled team configs, (4) easy bulk updates (change base, all derivatives auto-update). Use Grafana's dashboard provisioning API to deploy the generated JSONs. Implement a dashboard lint step that validates panel queries reference valid datasources, variable definitions are consistent, and threshold anomalies exist.

Follow-up: A team wants their custom panel to override a query threshold without rebuilding. How would you allow runtime overrides without breaking the GitOps model, and what are the tradeoffs vs. regenerating and redeploying?

During a prod incident, a junior developer accidentally modifies a critical dashboard in the UI, deleting key alert rules and panels. The changes go unnoticed for 2 hours. After restoring from version control, you want to prevent this from happening again. Design a multi-layer solution that balances security, usability, and emergency flexibility.

Implement a three-layer defense: (1) RBAC with role-based dashboard edit permissions—only certain roles can modify prod dashboards, with read-only for junior devs. (2) Audit logging—store all dashboard changes (via API or admin panel) in an external log system (Loki, CloudWatch), capturing who changed what, when, and previous values. (3) Automated drift detection—run a periodic job comparing deployed dashboard JSON against Git, alerting on mismatches and auto-reverting unauthorized changes. Add a manual-approval gate for prod dashboard edits via a dashboard admin interface. For emergencies, implement an emergency-mode flag allowing bypassing RBAC with audit trail capture and automatic incident ticket creation. Version all dashboards automatically in Git after changes, enabling fast rollback.

Follow-up: Your audit log shows someone modified a dashboard at 2 AM. Your RBAC says they shouldn't have access. What systematic approach would you take to investigate, and how would you prevent this vector (compromised credentials, backdoor)?

You're deploying dashboards across Grafana instances in 5 different cloud regions and on-prem. Each environment has different datasources, query authentication methods, and alert routing targets. Provisioning is failing because queries hardcode datasource UUIDs and API endpoints. Redesign the provisioning system for multi-cloud/multi-region flexibility.

Decouple datasources from dashboard definitions using logical datasource names (e.g., "prod-prometheus", "staging-loki") stored as Grafana variables. In provisioning YAML, reference datasources by name, not UUID. Use environment-specific provisioning configuration files that map logical names to actual datasources in each region. Structure: dashboards/base/ (environment-agnostic definitions), provisioning/prod-us-east/ (region-specific mappings), provisioning/on-prem/ (on-prem overrides). Implement a provisioning pre-processor that interpolates environment variables (DATASOURCE_PROMETHEUS, ALERT_WEBHOOK_URL) before deploying dashboards. For queries, use Grafana's templating syntax ($datasource variable) to dynamically bind at runtime. Store region-specific authentication (API keys, certificates) in Grafana's secure storage, referenced by environment labels. Build a multi-environment test suite that validates dashboard queries work against mock datasources in each region before production deployment.

Follow-up: A query works in staging but fails in prod with "authentication denied." The datasource configs look identical. How would you debug this, and what systematic differences might exist between environments?

Your dashboard provisioning system works well, but the feedback loop is slow: changes to a dashboard take 30 minutes to deploy through CI/CD. Developers want faster iterations. However, you can't allow direct UI edits because of the earlier drift problem. Design a hybrid workflow that enables fast local development while maintaining GitOps.

Implement a local development mode using Docker Compose with a Grafana instance that syncs provisioned dashboards to a local Git branch. Developers clone the repo, run docker-compose up, modify dashboards in the local Grafana UI, and the provisioning system auto-exports changes to JSON files and commits to their branch. This provides instant feedback while keeping GitOps as the source of truth. Set up a pre-commit hook that validates dashboard JSON schema and runs query linting. After local testing, developers push to a feature branch, triggering a CI/CD pipeline that deploys to a short-lived staging environment for integration testing. Fast-track hotfix deployments: create an emergency deployment process (manual approval only) that bypasses standard CI for critical dashboard fixes, but with mandatory audit logging and automatic rollback timing. Use Grafana's HTTP API to enable programmatic dashboard exports from local dev instances back to JSON files, automating the sync.

Follow-up: Your local sync is exporting dashboard changes, but it's committing unrelated whitespace and field reordering, making diffs noisy. How would you normalize dashboard JSON to keep Git history clean?

You're onboarding a new team with 0 Grafana experience. They need their own dashboards but shouldn't have to learn provisioning YAML syntax. Design a self-service onboarding experience that reduces friction while maintaining your Dashboard-as-Code standards.

Build a dashboard scaffold/generator tool accessible via web form or CLI. The tool collects minimal inputs (team name, service name, metrics to monitor, alert thresholds, datasource type) and generates a complete dashboard definition, provisioning config, and alerting rules using templates. Output is a Git PR with the new dashboard that the team can immediately review and customize. Pair this with a simple style guide (dashboard naming conventions, panel organization, variable naming) and a library of reusable JSON snippets for common patterns (latency percentiles, error rates, database connections). Host example dashboards in the repo with extensive comments. Provide a "dashboard diff" tool that shows what changed when they update their dashboard, making it easy to review modifications. For teams struggling, offer weekly office hours or a Slack bot that answers provisioning questions. Store team dashboard ownership in a CODEOWNERS file, automating PR routing.

Follow-up: A team's generated dashboard has incorrect queries that are silently failing (returning no data). Your generator validated them at creation time, but the queries should have failed. What validation gaps exist, and how would you catch this in the generator?

Your Grafana provisioning system relies on Terraform for infrastructure. The TF state file drifts from actual Grafana state when team members manually edit dashboards via the UI. The terraform import workflow is cumbersome for 200 dashboards. Design a sync mechanism that keeps Terraform state accurate without blocking manual edits or requiring frequent manual imports.

Implement bidirectional sync: (1) Use Grafana's state-lock API to detect manual changes via webhooks or periodic polling. When a dashboard is manually edited, automatically generate a Terraform resource definition (terraform-provider-grafana's grafana_dashboard resource) and commit it to a Git branch, triggering a PR for approval. (2) Conversely, when Terraform applies dashboard changes, Grafana's provisioning system validates the state matches the TF configuration. (3) Use terraform refresh as a scheduled job to re-sync state from Grafana's actual state, capturing any drift. Store TF code in a separate directory (infrastructure/grafana/) with auto-generated resources commented as "managed by sync script—do not edit." Implement a "claim" mechanism: when an existing dashboard is discovered, the sync tool generates a terraform import command, capturing it in CI/CD output. Use Grafana API's ETag (last modified time) to detect concurrent changes and alert on conflicts instead of blindly overwriting.

Follow-up: A team edits a dashboard in UI, triggering Terraform to auto-import it, then a developer applies a Terraform change 30 seconds later. A race condition creates conflicting versions. How would you handle concurrent edit conflicts, and what guarantees should you make?

You're auditing dashboard provisioning security for compliance (SOC2, HIPAA). Your dashboards reference encrypted database credentials in query strings (hardcoded base64-encoded secrets). You need to eliminate secrets from dashboard JSON while maintaining backwards compatibility with dashboards that still have them.

Audit all dashboards for embedded secrets using a GitOps secret scanner (TruffleHog, detect-secrets). For each dashboard containing secrets: (1) Extract the secret, store it securely in Grafana's secure storage (AWS Secrets Manager, HashiCorp Vault), and update the datasource configuration to reference it. (2) Update dashboard queries to use Grafana's datasource variable interpolation instead of hardcoded credentials. (3) Create a provisioning validation step that rejects any dashboard JSON containing secret patterns (base64 encodings, API keys, database passwords). Implement a one-time migration job that scans all provisioned dashboards, extracts secrets, stores them, and auto-updates dashboard JSON. For backwards compatibility during transition, use a compatibility layer in the provisioning system that detects hardcoded secrets, warns in logs, and triggers an automatic remediation workflow. Document the deprecation and set a sunset date for raw secrets in dashboards. Use Grafana's audit logs to track all secret access, enabling forensic investigation if compromised.

Follow-up: Your secret scanner found 47 dashboards with embedded secrets. You need to remediate them in prod without causing downtime. What's your sequencing strategy, and how would you validate the migration didn't break queries?