Grafana Interview — Embedding, Security, and Public Dashboards

Your company wants to embed Grafana dashboards in a customer-facing web app. Customers should only see their own data (customer A shouldn't see customer B's dashboard). Your Grafana instance contains data for all customers. Design a multi-tenant embedding system that enforces data isolation.

Implement secure multi-tenant embedding: (1) Tenant-aware datasources—datasources are configured with tenant filtering: Prometheus queries include label_values with tenant ID. (2) Proxy authentication—use a proxy service that sits between customer app and Grafana. Customer logs into app → app generates short-lived proxy token → proxy token is exchanged for Grafana session token valid only for that customer's data. (3) Dynamic dashboards—dashboards use variables to filter by tenant_id. Variables are populated from proxy context (app passes tenant ID). (4) Embedded iframe—use Grafana's embed API with tenant context: pass tenant_id in iframe URL or auth token. (5) Row-level security (RLS)—at database level, enforce RLS: queries automatically filter by tenant_id. (6) Audit logging—log all embedded dashboard views: who (customer ID), what (dashboard), when. Detect unauthorized access attempts. (7) Token lifecycle—embed tokens are short-lived (15 min expiry), cannot be used outside the embedded context (browser restriction via CORS). Implement iframe sandboxing: Grafana embedded iframes have restricted capabilities (no navigation, no access to parent context). Prevent security breaches in embedded context from affecting parent app. Use CORS and CSP headers to restrict where dashboards can be embedded. Create compliance report: audit all embedded views, verify no cross-tenant data access. Test multi-tenancy: create 2 test customers, verify customer A can't see customer B's data even in embedded dashboards. Implement honeypots: decoy tenant_ids in dashboards, alert if accessed (indicates unauthorized access attempt).

Follow-up: Your proxy token exchange is working, but a sophisticated attacker crafts a fake proxy token with tenant_id = "all_customers". Grafana doesn't validate the tenant_id source. How do you prevent token forgery?

You're embedding dashboards in a public website (customer-facing status page). The embedded dashboard should show public metrics (uptime, latency) but nothing sensitive. A developer accidentally puts a dashboard with internal metrics (database query logs) on the public page. Design a system that prevents accidental exposure of sensitive data.

Implement sensitive data protection for public dashboards: (1) Dashboard tagging—tag dashboards as public/internal/confidential. Public dashboards can be embedded; others cannot. (2) Content scanning—before marking a dashboard public, scan for potentially sensitive fields: password_hash, secret_key, api_key, user_id, customer_data. Alert if found. (3) Redaction rules—define redaction rules: sensitive metrics (database query logs) are redacted in public view. (4) Query filtering—public dashboards can only query whitelisted metrics/datasources. Internal datasources are off-limits. (5) Access control enforcement—public dashboard embedding enforces RBAC: viewer can only see panels they have permission to view. (6) URL restrictions—public sharing generates tokens with restrictions: valid for 30 days, read-only, specific to that dashboard. (7) CSP headers—implement strict Content Security Policy headers to prevent clickjacking attacks against embedded dashboards. Implement a dashboard safety review process: before marking public, a security person reviews queries and data. Checklist: "no PII exposed? No credentials? No internal infrastructure details?" Create separate "public dashboard" dashboard type with restricted features: no variable editing, no query editing, no data export. This prevents users from pivoting to sensitive data. Implement monitoring: detect if public dashboard is accessed with internal metrics (indicates misconfiguration). Alert security team. Test public dashboard security: run automated scanner checking for PII patterns. Create runbook: "publishing dashboard to public." Steps include safety review checklist, approval process.

Follow-up: Your redaction is working, but a panel uses an aggregation query (SUM) that still leaks cardinality: "5 internal API keys" is inferred from aggregated counts. How do you prevent cardinality leakage through aggregations?

You're embedding a Grafana dashboard in your app. To reduce latency, you want to cache the rendered dashboard. But the dashboard shows live metrics that change every minute. Caching for 1 hour makes dashboards stale. Design a smart caching strategy that balances performance and freshness.

Implement intelligent embedding cache: (1) Query-level caching—cache individual query results with TTL based on query type. Rapidly-changing metrics (request rate) cached 30s; slowly-changing metrics (deployment status) cached 5min. (2) Panel caching—cache rendered panels separately. If a dashboard has 10 panels with different cache TTLs, serve some from cache, refresh others. (3) Partial refresh—on each view, background-refresh queries approaching cache expiry. User sees mostly-fresh data without waiting. (4) Client-side hints—dashboard sends cache hint headers: "max-age=30s for these queries, 5min for those." Client can adjust. (5) Conditional refresh—refresh cache only if underlying data changed. Use Prometheus' remote read API to check metric values; if unchanged, serve from cache. (6) Adaptive caching—cache aggressively during peak load (reduce backend pressure). Reduce cache TTL during off-peak (improve freshness). (7) User preferences—allow users to choose cache strategy: "fast mode (aggressive cache)" vs. "live mode (no cache)". Implement WebSocket subscriptions: for live dashboards, maintain WebSocket connection to Grafana. Pushnew data to client automatically instead of polling. This provides live updates without caching concerns. Build cache observability: track cache hit rate, average staleness. Alert if cache is too stale (e.g., >5min old) or cache hit rate is too low (cache not helping). Create cache policy document: which metrics/dashboards can be cached, how long. Review quarterly as usage patterns change.

Follow-up: Your cache hit rate is 85%, but during a major incident, the cached dashboard shows pre-incident metrics for 5 minutes while the incident is happening. Teams make wrong decisions based on stale data. How do you provide emergency bypass?

A customer embeds Grafana dashboards in their internal web app. An attacker performs a clickjacking attack: they overlay an invisible iframe over a Grafana panel, trick users into clicking it, and execute actions (delete dashboard, modify queries) on behalf of the user. Design defenses against clickjacking.

Implement clickjacking protection: (1) X-Frame-Options header—set X-Frame-Options: DENY or SAMEORIGIN to prevent embedding in iframes. For intentional embedding, use SAMEORIGIN (only same domain). (2) Content Security Policy (CSP)—strict CSP headers: frame-ancestors directive restricts who can embed Grafana. (3) Frame busting JavaScript—Grafana can include JS that breaks out of frames if embedded in unexpected context. (4) SameSite cookies—set SameSite=Strict on Grafana session cookies. Prevents cookies from being sent in cross-site requests. (5) User confirmation—for sensitive actions (delete dashboard, modify datasource), require explicit user confirmation. Don't auto-execute actions. (6) Transparent pixel detection—detect when page contains overlaid transparent elements. Block interactions if suspicious. (7) CSRF tokens—for state-changing operations, require CSRF token generated per-session. Attacker can't forge valid CSRF token. Implement iframe sandboxing: when embedding Grafana, use sandbox attributes: . This restricts iframe capabilities. Build attack surface reduction: disable unnecessary features in embedded context. Embedded dashboards are read-only; editing disabled. Implement monitoring: detect suspicious interaction patterns—lots of failed actions, access from unusual referrers. Alert security team. Test clickjacking scenarios: create test attacker page, verify Grafana is protected. Use automated security scanners (OWASP ZAP) to check for missing headers. Create runbook for embedding securely: security checklist, header configuration, testing steps.

Follow-up: Your X-Frame-Options: SAMEORIGIN allows embedding in same-domain iframes. An attacker compromises a same-domain web app and injects an attacker-controlled iframe. SAMEORIGIN doesn't help. How do you defend against same-origin attackers?

You've embedded Grafana dashboards in your public status page. The dashboard shows uptime metrics. An attacker modifies the embedded dashboard (via compromised Grafana instance) to show 100% uptime when system is actually down. Customers see false status. Design a integrity verification system for embedded dashboards.

Implement integrity verification for embedded dashboards: (1) Dashboard signing—when embedding, generate a cryptographic signature of the dashboard config using a private key only your server knows. (2) Signature verification—in the embedded context, verify signature before rendering. If signature is invalid, display error: "Dashboard integrity check failed." (3) Version pinning—embed a specific dashboard version (hash), not latest. Changes to dashboard require updating the pinned version. (4) Content hash verification—compute SHA256 of rendered dashboard content. Compare to known-good hash. If mismatch, alert. (5) Delegation tokens—use short-lived delegation tokens (signed by your server) that embed dashboard identity. Grafana validates token before serving. (6) Audit trail—log all changes to embedded dashboards. If dashboard is modified unexpectedly, alert ops. (7) Canary checking—before embedding a dashboard, your server fetches it and validates all data looks correct (uptime metrics in expected range). If anomaly, don't embed. Implement a safe dashboard registry: curate list of approved dashboards for embedding with their expected configurations. Any deviations are flagged. Use a CDN to cache dashboard configurations. Store cache signatures. If Grafana is compromised, cache provides uncompromised version for some time. For critical dashboards, implement human review: every 24 hours, a person reviews the embedded dashboard visually to spot tampering. Create a dashboard integrity monitoring dashboard: shows signature validation results, audit trail. Alert on integrity failures. Test tampering scenarios: modify Grafana dashboard, verify integrity checks catch it.

Follow-up: Your signature verification is working, but re-signing the dashboard every time it's updated is cumbersome. A legitimate dashboard update takes 2 hours because of signing delays. How would you automate legitimate updates while preventing tampering?

You have a public dashboard embedded on your website showing system status. Traffic is 10K requests/second during peak. Each request fetches the dashboard from Grafana (query metrics, render panels). Grafana is now the bottleneck. Design a scalable public dashboard architecture.

Implement scalable public dashboard delivery: (1) CDN caching—cache rendered dashboard HTML on CDN (Cloudflare, CloudFront). Cache expiry 30-60 seconds for live dashboard feel. (2) Static pre-rendering—generate static HTML of dashboard periodically (every 10s). Serve from CDN as static file. No Grafana load. (3) Client-side refresh—use JavaScript to refresh specific panels on cadence (WebSocket or polling). Brings data freshness while leveraging static HTML. (4) Query optimization—aggregate queries: instead of fetching current metrics for each request, pre-aggregate metrics on backend, return single query. (5) Edge computing—use edge functions (Cloudflare Workers) to render dashboard at edge, closer to users. Reduced latency. (6) Asset optimization—minimize dashboard assets: minify JS, optimize images. Reduce payload. (7) Progressive enhancement—send minimal HTML immediately, JavaScript progressively enriches with data. Page feels fast. Implement a dashboard API endpoint: client calls /api/dashboard/{id}/data, gets JSON of current metric values. CDN caches HTML; JS fetches fresh data from API (lightweight). Separate concerns: reduce Grafana load. Use Grafana's HTTP API caching: Grafana caches query results internally, serving repeated queries faster. Build a metrics dashboard: embedding request latency, cache hit rate, origin requests to Grafana. Alert if cache hit rate drops below 80%. Test scalability: load test with 10K req/sec, measure latency distribution, identify bottlenecks. For critical dashboards, implement redundancy: multiple Grafana instances behind load balancer, if one fails, others handle traffic.

Follow-up: Your CDN cache is 30s, but during an incident, status shows outdated info (system went down 20s ago, cached dashboard still shows "up"). Customers miss the incident notification. How would you reduce cache time without overloading Grafana?

Your company wants to share Grafana dashboards with external partners (vendors, customers). But sensitive internal metrics (infrastructure details, team capacity) shouldn't be exposed. You can't simply share the dashboard as-is. Design a dashboard sharing workflow that sanitizes sensitive information.

Implement secure partner dashboard sharing: (1) Dashboard versioning—create partner-safe versions of dashboards with sensitive data removed. Store as separate dashboard. (2) Panel filtering—mark panels as "internal-only" or "external-safe". Export only external-safe panels for sharing. (3) Query rewriting—rewrite queries to aggregate sensitive data: instead of "queries per database", export only "total queries (aggregated)". (4) Metric filtering—whitelist metrics allowed for external sharing. Queries referencing non-whitelisted metrics are blocked. (5) Access control—partner dashboards are read-only, no export, no variable modification. (6) Audit logging—track all partner dashboard access: who viewed, when, from where. (7) Data classification—tag metrics as public/internal/confidential. Enforce classification during share. Implement a dashboard review workflow: before sharing externally, a data steward reviews the dashboard. Checklist: "no internal metrics exposed? No PII? No infrastructure details?" Create a partner dashboard template: pre-configured dashboard with only appropriate panels. Partners use template as starting point. Provide a "sanitize dashboard" tool: automatically remove sensitive panels, redact internal metrics. Partners can use output as basis. Set up partner portal: curated list of dashboards available for sharing. Partners browse, request access, receive sanitized version. Implement time-limited sharing: partner access expires after 30 days. Renewal requires new approval. Create audit reports: monthly summary of partner dashboard access, detect anomalies (excessive downloads, access outside business hours).

Follow-up: Your partner dashboard hides infrastructure metrics, but aggregated query volume still leaks capacity info: "high query volume" implies high traffic. Partner infers your scale. How do you prevent inference attacks?

You're supporting a SaaS product. Customers want to embed Grafana dashboards in their own applications (showing their own data in your SaaS UI). You need to provide a programmatic API for customer app to request embedded dashboards dynamically. Design a secure customer embedding API.

Implement secure customer embedding API: (1) OAuth 2.0 authorization—customers authenticate via OAuth, receive access tokens scoped to their account/dashboards. (2) Scope-based access—tokens have granular scopes: "read:dashboard_id_123", "read:all_dashboards", "write:dashboards". (3) Customer context passing—customer's app calls your API with customer_id + oauth_token. Your backend generates embed URL with tenant context. (4) Short-lived embed tokens—generate temporary embed tokens (15min expiry) that can only be used once, from specific IP, for specific dashboard. (5) Rate limiting—enforce rate limits per customer: "1000 API calls/hour". Prevent abuse. (6) Audit trail—log all embedding requests: customer_id, dashboard_id, IP, timestamp. (7) Signature verification—embed tokens are signed (HMAC-SHA256 with private key). Customer's iframe validates signature before rendering. Implement an API gateway: customer apps call /api/embed/{dashboard_id}. Gateway validates auth, generates embed token, returns iframe URL. Document API thoroughly: authentication, scopes, rate limits, error codes. Provide SDKs in popular languages (Python, JavaScript, Go) to simplify integration. Implement billing/metering: track embedding API usage per customer. Bill based on embeddings/month. Implement security best practices: CORS headers, rate limiting, request logging, anomaly detection. Alert on suspicious patterns: "customer suddenly requesting 100K embeddings—investigate"). Test security: try to forge embed tokens, access other customer dashboards, exceed rate limits. Verify all fail gracefully.

Follow-up: A malicious customer modifies the embed token (changes dashboard_id before sending to iframe). The iframe loads a different customer's dashboard. Token validation failed. How would you prevent token tampering?