Grafana Interview — RBAC and Team Management

Your organization is growing: 200 people across 15 teams. Everyone has admin access to Grafana (early-stage privilege creep). A junior developer accidentally deletes a critical alert dashboard used by on-call. You need fine-grained access control. Design an RBAC system that allows teams autonomy while preventing accidental/malicious damage.

Implement comprehensive RBAC: (1) Role hierarchy—define roles aligned to responsibility: Viewer (read-only dashboards), Editor (create/edit dashboards), Admin (manage users/teams), SuperAdmin (manage entire instance). (2) Team structure—organize users into teams (Platform, Payment, Frontend). Teams have their own dashboards and datasources. (3) Dashboard permissions—set per-dashboard access: Team-A dashboards readable by Team-A, editable by Team-A leads. Other teams read-only or no access. (4) Datasource permissions—restrict datasource access by role/team. Production datasources (prod-db) editable only by leads; other users get read-only. (5) Folder organization—organize dashboards into folders by team/service. Apply permissions at folder level: Team-A folder writeable by Team-A. (6) API token scoping—when generating API tokens for automation, scope permissions narrowly: "this token can only read dashboards" or "only write metrics". (7) Audit logging—log all permission changes and access. Track who granted which access and when. Implement permission escalation workflows: a user without write permission can request access; manager approves. This prevents accidental admin assignments. Use CODEOWNERS pattern: in a Git repo tracking dashboard configs, require approval from team leads before merging changes. Implement dashboard deletion protection: critical dashboards require dual approval (two admins) before deletion. Alert on unusual access patterns: "user normally views Dashboard A, today viewed 50 dashboards"—investigate. Create role documentation: clear job descriptions for each role, permission matrix. Conduct quarterly access audits: is everyone's access still appropriate? Remove stale permissions.

Follow-up: Your RBAC is working, but a team lead's account is compromised. They use their admin credentials to delete 20 dashboards. You detect it 2 hours later. How would you recover and prevent account compromise damage?

Your Grafana instance serves 50 teams, each with different datasources (team-specific Prometheus, Elasticsearch). You want teams to manage their own datasources and grant/revoke access, but you can't give all users admin access. Design a delegation system for datasource management.

Implement delegated datasource management: (1) Datasource admins—per-team or per-datasource, designate admins who can manage access (grant/revoke, modify credentials, delete). (2) Admin scope—datasource admin can only manage their assigned datasources, not global Grafana. This limits blast radius. (3) Approval workflows—for sensitive operations (deleting datasource, granting access to production datasource), require approval from another admin. (4) Credential rotation—datasource admins can rotate credentials (database passwords, API keys) stored in Grafana. Track rotation history. (5) Access templates—pre-define permission templates: "read-only developer", "on-call engineer" (read + query). Admins apply templates instead of manually granting permissions. (6) Audit trail—log all datasource admin actions: who modified what, when, for what reason. (7) Escalation—if a datasource admin is unavailable, provide escalation path (backup admin) to prevent access lock-outs. Implement datasource group permissions: datasources can be grouped (prod, staging, dev). Admins grant/revoke access at group level. Provide UI for datasource admins: dashboard showing their datasources, access list, admin actions. Make it easy to operate. Create runbook for datasource admins: common tasks (adding new developer, revoking access, rotating credentials). Provide training (1 hour onboarding). Use this to scale governance: teams self-manage datasource access instead of central admin handling all requests.

Follow-up: A team's datasource admin delegated access to a junior developer, who accidentally exposed the datasource credentials in a dashboard variable. Credentials are now compromised. Who's responsible, and how do you prevent this?

Your organization uses SAML/SAML2 (via Okta) for authentication. When employees leave, they lose Okta access. But their Grafana access persists—they can still log in with cached tokens. Design an authentication system that immediately revokes access on employee termination.

Implement real-time access revocation: (1) SAML session lifetime—set session lifetime to 1 hour. After 1 hour, users must re-authenticate via SAML. (2) Immediate SAML invalidation—when employee is disabled in Okta, their SAML assertion becomes invalid. Next Grafana login fails. (3) Periodic sync—schedule a background job to sync Okta user list to Grafana every 15 minutes. Delete Grafana users not in Okta. (4) Token revocation—track all active sessions/tokens. When user is disabled in Okta, invalidate all their tokens in Grafana. (5) Grace period—provide short grace period (5 minutes) for employees' outstanding requests to complete. Then block. (6) Just-in-time (JIT) provisioning—on SAML login, pull user details from Okta (groups, attributes). Disable user if not in correct groups. (7) Logout on revocation—if a user is revoked while logged in, terminate their session within 1 minute. Redirect to "access revoked" page. Implement emergency access—for break-glass access (when Okta is down), allow local admin login. Track and require escalation. Set up alerting: if a revoked user attempts to log in, alert security. Build metrics: user sync frequency, session lifetime, revocation latency. Create runbook for termination: engineer notifies admin "employee X is leaving today"; admin disables Okta user; Grafana auto-disables within 15 minutes. Test process quarterly: simulate employee termination, verify access revoked immediately. For compliance, maintain audit log of all access revocations with timestamps.

Follow-up: An employee was terminated yesterday; Okta user disabled. But backups show they're still accessing Grafana via an API token they created 1 month ago. How do you revoke old API tokens on termination?

Your teams are distributed globally (US, Europe, India, Singapore). Access control policies should reflect time zones: a dashboard should only be accessible during business hours in the user's timezone. Design a time-aware RBAC system.

Implement time-aware access control: (1) Business hours enforcement—define business hours per team or per region. Access to sensitive dashboards only during those hours. (2) Timezone-aware checks—when access decision is made, get user's timezone (from profile), check if current time in their timezone is within allowed hours. (3) Schedule definition—allow flexibility: "Team-A normal business hours are 9-5 IST; US team can access 9-5 EST." Store in RBAC config. (4) Emergency override—provide break-glass: user can request emergency access outside hours. Requires manager approval, logged for audit. (5) Timezone detection—automatically detect user's timezone from IP/profile. Allow manual override. (6) Grace period—add buffer: access allowed 15 minutes before business hours (for morning prep) and 30 minutes after (wrap-up). (7) Alerting—log access outside business hours: "user X accessed prod dashboard at 2 AM (outside business hours)." Alert security if suspicious pattern. Implement timezone-aware dashboard display: when user logs in, show dashboards in their timezone ("all times displayed in IST"). Prevent confusion. Create policy configuration UI: teams define access hours easily. Store in Git for versioning/audit. For time-sensitive alerts (on-call), allow access to specific dashboards 24x7 for designated on-call users. Set separate rules: "on-call engineer can access incident dashboards 24x7." Test timezone enforcement: create test user in different timezone, verify access allowed during their hours, denied outside. Ensure no off-by-one errors.

Follow-up: Timezone-aware access works, but daylight saving time (DST) causes issues. US team transitions to daylight time; India doesn't. Access is suddenly allowed at wrong times. How do you handle DST transitions?

Your organization has strict compliance requirements (HIPAA, PCI DSS, SOC2). All access to production dashboards must be justified: "why are you accessing this?" Auditors need proof that access was reviewed and approved. Design an access justification and audit system.

Implement justification and audit-ready RBAC: (1) Access request workflow—users requesting access to sensitive dashboards submit a form: "reason for access", "how long", "project". (2) Approval chain—request goes to manager and dashboard owner for approval. They review reason and approve/deny. (3) Time-bounded access—approved access is temporary (e.g., 1 week for project). Auto-revoke after expiry. (4) Audit trail—store full history: who requested access, reason, who approved, when granted, when revoked, what they accessed. (5) Session recording—for sensitive dashboards, record user sessions (queries run, dashboards viewed, data exported). Store in immutable log. (6) Least privilege—grant minimum necessary access. "User needs to debug service X; give access to Service X dashboard only, not entire prod dashboard suite." (7) Compliance export—generate monthly/quarterly audit reports: all access grants/revokes, justifications, reviews. Export to compliance system. Implement sensitive data detection: flag dashboards/queries returning PII/financial data. Require additional approvals. Create compliance dashboard: overview of access justifications, approval SLA compliance, outliers (e.g., "access granted but never used"). Set up scheduled reviews: every quarter, re-certify access. Users must justify continued access; if unjustified, auto-revoke. Maintain comprehensive audit logs (7+ years retention). Use immutable storage (append-only logs in S3). For incidents, provide incident response automation: when security incident occurs, audit logs can quickly show who accessed what when.

Follow-up: Your access request workflow requires approval from 3 people (manager, dashboard owner, security). A developer needs urgent access during an incident (now). The approval takes 2 hours. How do you balance security with incident response?

Your RBAC is complex: 50 roles, hundreds of permissions, nested teams. When you need to modify access (e.g., "engineers can now read finance dashboard"), you must update multiple places. It's error-prone. Design a declarative RBAC system that's maintainable.

Implement declarative RBAC with versioning: (1) RBAC-as-code—store all RBAC policies in Git as YAML/JSON files. Version everything. (2) Policy files—organize by concept: roles.yaml (define roles), permissions.yaml (define permissions), assignments.yaml (who has what role/permission). (3) Templating—use templating to reduce duplication: "for each team, create roles: viewer, editor, admin." (4) Role inheritance—define role hierarchy: engineer inherits from viewer; lead inherits from engineer. Permissions cascade. (5) Policy validation—before applying, validate RBAC policies: check for cycles, missing permissions, orphaned roles. Run in CI/CD. (6) Policy diffing—when reviewing RBAC changes in Git PR, show clear diff: "role 'engineer' now has 'read_finance_dashboard' permission." (7) Audit trail—all RBAC changes go through Git commits. Full history, who changed what, when, why (commit message). Implement automatic policy application: merge PR → CI/CD applies RBAC changes to Grafana. No manual intervention. Test RBAC changes: simulate access checks with new policy, verify expected behavior. For debugging, provide policy simulation: "if user X is in group Y, what can they access?" Create runbook: RBAC change procedure. Governance: breaking changes (removing permissions) require explicit approval; non-breaking (adding permissions) can auto-merge if tests pass. Implement emergency policy: provide read-only override that works even if RBAC policy is broken. Prevents total lockout. For compliance audits, export RBAC state as of any point in time from Git history.

Follow-up: Your declarative RBAC looks clean on paper, but during a complex refactor, a Git merge conflict occurs in roles.yaml. A junior engineer resolves it incorrectly. Several teams lose write access. How do you prevent policy merge errors?

A team member has read access to a dashboard containing sensitive metrics. They share the dashboard with a broader audience via a public link, exposing the data. You need RBAC to control who can share dashboards and with whom. Design a sharing control system.

Implement sharing restrictions and audit: (1) Share permissions—separate "view dashboard" from "share dashboard" permissions. Not all viewers can share. (2) Share approval—require permission/approval before sharing outside the team. (3) Share scope—control who dashboard can be shared with: team-only, organization-wide, public (with restrictions). (4) Public link security—public links include access token with limited lifetime (expires in 24 hours, can be extended with approval). (5) Link auditing—log all sharing actions: who shared with whom, what data was shared, when. (6) Data classification—tag dashboards by sensitivity (public, internal-only, confidential). Public links disabled for confidential dashboards. (7) Recipient tracking—know who has access via shared links. Revoke sharing if recipient leaves org or violates policy. Implement a sharing review process: before allowing public shares, flag potentially sensitive dashboards. Review containing PII, financial data, internal infrastructure details. Require lead approval. Create sharing policy: "financial dashboards can be shared with finance team only." Encode in RBAC. Provide sharing templates: pre-approved sharing scenarios (team-internal share, executive dashboard share). Simplify sharing. For sensitive data shared accidentally, implement rapid revocation: invalidate all sharing tokens immediately. Notify recipients: "this link is no longer valid." Log incident. Implement sharing metrics: what's shared, with whom, how often. Alert on anomalies: "dashboard shared with 100 new people in 1 minute"—suspicious activity.

Follow-up: Your sharing token expires in 24 hours, but a shared dashboard is still being viewed 3 days later (recipient extended the token). You can't tell when sharing ended. How do you maintain audit clarity on extended access?