Grafana Interview — Variables and Dynamic Dashboard Patterns

Your team maintains 100 dashboards that show similar patterns but for different services. Each dashboard has ~30 panels, 80% of which are identical queries repeated with different service names. Updating a common query means manually editing 100 dashboards. Design a system for reusable dashboard components and templating.

Implement a component-based dashboard architecture: (1) Dashboard templating—use Grafana's variable and templating features to create parameterized dashboard templates. A single "Service Monitoring" dashboard template accepts a service_name variable, dynamically loading queries for that service. (2) Query templates—for common query patterns (latency percentiles, error rates), create Grafana library panels. These are reusable across dashboards; updating a library panel auto-updates all dashboards using it. (3) Multi-value variables—implement multi-select variables: users can choose multiple services (e.g., "api, web, database") and dashboards show data for all selected services in parallel. (4) Cascading variables—implement dependent variables: "select environment (prod/staging)" → "select region (us-east/eu)" → "select cluster (cluster-1/cluster-2)". Changes to one variable auto-update dependent variables and queries. (5) Variable sources—define variables from multiple sources: static list (choose from ["prod", "staging"]), query result (run a Prometheus query to get available service names), or API call (fetch from service discovery API). (6) Dashboard provisioning—store dashboard templates in Git. Use templating engine (Jsonnet, Helm, Kustomize) to generate instance dashboards per service. Deploy via provisioning API. (7) Version management—track dashboard component versions. If a query template is updated, alert dashboards using it and provide upgrade path. Document templating patterns in runbook with examples. Provide a dashboard generator tool that teams use to create new service dashboards; templates are auto-selected. Implement version pinning: older dashboards can pin to old component versions, preventing unintended changes.

Follow-up: Your library panel is used by 50 dashboards. You update it to fix a query, and 45 dashboards improve, but 5 dashboards break (different query context). How would you have caught this before rolling out?

A user creates a complex dashboard with 20 variables. Setting up the variables correctly requires understanding dependencies and query details. New users find this confusing and create broken dashboards. Design a UX/tooling improvement for variables that reduces errors.

Implement variable management improvements: (1) Visual variable editor—instead of form-based config, show a visual dependency graph of variables. Users can see which variables feed into which queries, making dependencies explicit. (2) Query suggestions—when editing a query, suggest available variables. Highlight mismatches: "variable $region is selected but this query doesn't use it." (3) Validation—provide real-time query validation as users configure variables. Run test queries to ensure variables produce valid results. Alert on: unused variables, undefined variable references, circular dependencies. (4) Variable defaults—for common use cases (select service, select environment), provide pre-built variable templates with sensible defaults. Users insert template, then customize. (5) Variable preview—before applying, show preview of what queries will return with selected variable values. "If service = 'api', this query returns X series." (6) Dropdown sorting—implement smart sorting in variable dropdowns: show recently used values first, popular values first, alphabetically. (7) Error messaging—when a query using variables fails, show clear error: "variable $service has value 'invalid-service'. Valid values: [api, web, database]." Guide users to fix it. Implement a dashboard validation report: "Dashboard has 3 unused variables, 2 undefined variable references." Provide quick-fix buttons. Set up templates for common patterns (select service from list, select time range, multi-select environments). Create tutorial: "Variables 101" with step-by-step guide and common gotchas.

Follow-up: Your query validation suggests variables but can't validate their syntax during dashboard creation (datasources might be unavailable). How would you handle offline validation without access to live datasources?

Your dashboards have variables for time range, environment, service, and region. A user selects "prod" environment, expecting the region dropdown to show only regions with prod services. Instead, the dropdown shows all regions ever used. Users then select an invalid combination (prod + region-with-only-staging) and queries return no data. Design a solution for dependent/cascading variables that prevent invalid combinations.

Implement intelligent cascading variables: (1) Dependency definition—in dashboard config, define which variables depend on which. Example: region depends on environment; cluster depends on region. (2) Dynamic query updates—when environment changes, automatically update the region variable's query to filter by selected environment. Use templating in queries: "show regions where environment = $environment". (3) Availability filtering—for each variable, compute available values based on selections of parent variables. Dim/disable unavailable values in dropdowns. (4) Validation at query time—before running a query, validate that selected variable combinations are valid. If invalid (prod + staging-only-region), alert user: "This combination is invalid. Available regions for prod: [us-east, us-west]." (5) Smart defaults—when a user selects a parent variable, auto-select a default child variable if available. (6) Multi-hop dependencies—support complex dependencies: cluster depends on region depends on environment depends on deployment. Variables cascade properly. (7) Bidirectional sync—if a user selects an invalid combination, auto-correct: if region has only one environment, auto-update environment selection. Create a dependency graph visualization showing variable relationships. Test cascading logic: simulate various selections, verify all queries return valid results. Document dependency assumptions in dashboard description: "This dashboard assumes prod services are in us-east and us-west regions." Provide migration guide for existing dashboards to add cascading validation.

Follow-up: Your cascading logic is correct, but adding a new region requires updating 10 cascading variable definitions across 50 dashboards. How would you manage cascading dependencies at scale?

A team creates a dashboard with variables. They share it with another team. The second team imports it but their datasources have different names (their Prometheus instance is "prom-v2" instead of "prom"). All variable queries break because they reference the wrong datasource names. Design a solution for datasource-agnostic dashboard sharing.

Implement datasource abstraction for dashboard portability: (1) Logical datasource naming—instead of referencing specific datasources by name (e.g., "prom-v2"), use logical names (e.g., "default-prometheus", "logs-aggregation"). Store mapping between logical names and actual datasource names. (2) Datasource selection UI—when importing a dashboard, ask: "which datasource should 'default-prometheus' map to?" Teams select from their available datasources. Save mapping locally. (3) Query translation—when a dashboard is viewed, translate queries: replace "${default-prometheus}" with actual datasource name. This happens at runtime. (4) Multi-datasource support—a dashboard might use 3 datasources (Prometheus, Loki, Elastic). At import time, map each logical datasource to team's actual datasource. (5) Validation—before importing, validate all referenced datasources exist. Alert on missing mappings: "This dashboard requires 'logs-aggregation' datasource. Teams must provide mapping." (6) Sharing registry—maintain a registry of shared dashboards with their logical datasource requirements. When sharing, export list of required datasources. (7) Fallback logic—if a datasource mapping is missing, provide helpful error: "Datasource 'default-prometheus' not found. Available datasources: [prom-1, prom-2]. Which should be used?" Guide user to reconfigure mapping. Implement a datasource audit: verify all dashboards have valid datasource bindings. For public dashboards, support datasource templating: teams can override datasource per view. Create sharing guidelines: document which datasources dashboards require; provide mapping examples for common environments.

Follow-up: Your datasource mapping works, but a dashboard uses a query that's specific to "prom-v2" (Prometheus v2 query syntax). When mapped to "prom-v1", the query fails. How would you handle version-specific query syntax?

Your dashboard has 15 variables. A user selects specific values they care about for debugging. They want to share the exact dashboard state (specific variable values) with a colleague without manually reconfiguring. Design a dashboard state sharing system.

Implement dashboard state management and sharing: (1) State serialization—capture current dashboard state (variable values, time range, panel order) and serialize to a URL fragment (#state=...) or shareable URL. (2) URL encoding—encode state compactly: use base64 or custom compression to keep URLs under 2KB. Include state version to handle future format changes. (3) Named snapshots—allow users to save dashboard configurations as "snapshots": "prod-us-east-p95-latencies" captures variables, time range, panel settings. Other users can load snapshots instantly. (4) Share links—generate short links (tinyurl-style) that encode full state. Send to colleague; they click link, dashboard opens with exact same configuration. (5) State history—store recent dashboard states (last 20 views). Users can navigate back: "show me the dashboard state from 2 hours ago." (6) Collaborative debugging—pair a state link with an annotation. When sharing, include note: "This state shows the latency anomaly at 3pm." (7) Permission-respecting sharing—when sharing a state, ensure recipient has permissions to access all data in the dashboard. If they don't, provide note: "This dashboard uses datasources you don't have access to. Contact admin for permissions." Implement state diffing: compare two states and highlight changes. "Configuration 1: service=api, region=us-east. Configuration 2: service=web, region=eu-central. Differences: [service, region]." Create dashboard state templates: pre-save common debugging configurations (e.g., "high-latency investigation", "error spike analysis"). Users select template, state is auto-loaded.

Follow-up: Your state URL is 5KB (too long for Slack messages). Compression reduces it to 2KB but it's still unwieldy. How would you make sharing easier?

You have 50 services, each with a "Status" dashboard showing key metrics. These dashboards are almost identical: same panels, same layout, only variable values differ. Manually maintaining 50 copies is unsustainable. Design a system for auto-generating and maintaining per-service dashboards.

Implement dashboard auto-generation: (1) Template dashboard—create one "Service Status" dashboard template with variables for service_name, environment, region. (2) Service registry—maintain a registry of all services (name, team, environment, region) in a Git file or API. (3) Generation pipeline—build a process that: reads service registry, generates a dashboard instance for each service by instantiating the template with service variables. (4) Provisioning—use Grafana provisioning API to deploy generated dashboards. Store generated dashboards as JSON in a Git repo; provision on startup. (5) Updates—when template is updated, regenerate all service dashboards automatically. Changes propagate to all 50 dashboards instantly. (6) Service ownership—tag each generated dashboard with service owner (from registry). Use Grafana's RBAC to give service owners edit access to their dashboard, preventing accidental edits to other services. (7) Dashboard discoverability—auto-tag dashboards with team, service, environment. Users filter to find their dashboard. Implement dashboard search: "find all dashboards for payment service." Create naming convention: all generated dashboards follow pattern "[Environment] [Service] Status", enabling sorting and discovery. Set up monitoring: validate all 50 dashboards are healthy (queries not erroring, panels loading). Alert if any dashboard breaks due to template or service registry changes. For customization, allow service teams to override specific panels while inheriting common panels from template.

Follow-up: Service A's team wants to add custom panels to their auto-generated dashboard. If the template is updated, their custom panels are lost. How would you support customization without breaking auto-generation?

Your dashboards show performance metrics. A team member creates a dashboard with a variable that exposes sensitive internal hostnames (e.g., "select host: [db-prod-1, cache-prod-2, ...]"). They accidentally share it with a broader audience who shouldn't see internal infrastructure details. How would you prevent accidental information disclosure through variables?

Implement variable security and exposure control: (1) Sensitive variable detection—flag variables that might expose sensitive data: internal hostnames, API keys, database credentials. Prompt creators to review. (2) Masking in UI—when displaying variable dropdowns, optionally mask values: show "host-1 (production)" instead of full hostname. (3) Query result filtering—for queries that populate variables, filter results to exclude sensitive values. Example: "select service names" returns [api, web, database] but not internal implementation details. (4) Variable visibility controls—restrict which users can see/edit certain variables. Set RBAC: "only team leads can see 'internal-hosts' variable." Others see generic option ("all"). (5) Audit logging—log all variable value changes and usage. Detect if sensitive values are exposed. (6) Dashboard sharing security—before sharing dashboard, scan all variables for potential sensitive data. Alert: "This dashboard has 3 variables that might expose sensitive information. Are you sure you want to share publicly?" (7) Sandboxing—for public dashboards, run queries in isolated context with redacted variable values. Prevent exfiltration of sensitive values. Implement a dashboard security report: per-dashboard assessment of variable exposure risks. Provide remediation guidance: "replace hostname variable with environment variable (dev/staging/prod)." Create runbook: "securing dashboards for public sharing." Document best practices: "never use variables that expose infrastructure details." Set up review process: before marking dashboard public, security team reviews variables and queries.

Follow-up: Your variable masking hides sensitive values, but an attacker can still infer infrastructure details by probing: "if service='api', how many unique hosts?" You've just leaked cardinality. How do you prevent information leakage through query responses?

Your dashboard has a variable: "select time range." Users frequently select "last 7 days" but the dashboard's default is "last 30 days." Users are confused when they first load the dashboard and see 30-day data. Design a system for setting smart variable defaults based on usage patterns.

Implement intelligent variable defaults: (1) Usage-based defaults—track which variable values users select most frequently. If 70% of users select "last 7 days", make that the default for new dashboards. (2) Role-based defaults—different roles might need different defaults. For developers, default to "last 7 days" (recent debugging). For executives, default to "last 30 days" (broader view). (3) Context-aware defaults—in certain contexts (post-deployment, during incident), auto-set defaults. During incidents, default to "last 1 hour" (tight focus on recent behavior). (4) Time-of-day defaults—during business hours, default to "last 7 days". At 2 AM, when on-call debugging, default to "last 1 hour". (5) Personalization—allow users to save their preferred defaults. When they load the dashboard, their preferences are applied. (6) Smart defaults for new users—track user's first dashboard load. Use common defaults for new users; over time, learn their preferences and update defaults. (7) Dashboard-level vs. system-level—allow per-dashboard default overrides. "ServiceX dashboard defaults to 7-day; ServiceY defaults to 1-hour" based on monitoring needs. Implement A/B testing: try different defaults for different user cohorts, measure which improves efficiency (fewer re-selections, faster debugging). Provide UI to change defaults: "Would you like to make this your default?" Implement defaults persistence: store in user profile or local browser storage. Use analytics to identify confusing defaults: "variable time_range is changed by 80% of users within 30s of loading dashboard" → default is probably wrong.

Follow-up: Your personalized defaults are set per-user, but on-call engineers rotate dashboard access across teams. Each team has different time-range preferences. How would you handle shared dashboard defaults across rotating users?