Grafana Interview — Grafana OnCall and Incident Management

Your on-call team uses PagerDuty for escalation but Grafana OnCall for incident context. Context lives in two places: engineers lose time switching between systems. A critical incident requires quick decision-making; every second matters. Design unified incident management that consolidates context.

Implement unified incident management: (1) PagerDuty integration—OnCall syncs incidents with PagerDuty. When incident is created in OnCall, PagerDuty ticket is auto-created. (2) Unified timeline—OnCall stores single timeline of incident: alert fired, on-call notified, acknowledged, root-cause identified, resolved. All actions in one place. (3) Context injection—when incident is created, auto-populate context: affected service, relevant dashboards, related runbooks, recent code changes. (4) Alert context—include original alert details in incident: query that fired, metric values, thresholds. (5) Communication—OnCall provides Slack integration: incident updates auto-post to team Slack channel. Team stays informed without leaving Slack. (6) Decision log—store all decisions and reasoning: "why did we restart the service?" Enables post-mortem learning. (7) Automation—OnCall can auto-execute remediation actions (restart service, trigger runbook, page senior engineer). Reduces MTTR. Implement incident timeline UI: shows alert → context gathered → investigation → fix applied → resolved. Clear narrative. Build incident metrics: MTTD (mean time to detect), MTTA (mean time to acknowledge), MTTR (mean time to resolve). Track by team and over time. Create templates for common incident types (database overload, network partition, memory leak). Pre-populate context and automated responses. Test incident simulation: trigger fake alerts, walk through OnCall workflow. Measure time to resolution. Create runbook: "using OnCall for incident response."

Follow-up: Your OnCall integrates with PagerDuty, but a critical incident fires in OnCall at 2 AM. The engineer on-call is asleep; PagerDuty notification is still creating (5-second delay). By the time engineer is notified, issue is 10 minutes old. How do you reduce notification latency?

An incident occurs: customer data is being lost (database corruption). OnCall correctly escalates, engineer investigates, finds a bug deployed 2 hours ago. Engineer rolls back deployment. But 10K customer records are corrupted. OnCall helped with incident detection/response, but didn't help prevent the bug. Design a system for preventing incidents through better pre-deployment testing.

Implement incident prevention via observability: (1) Pre-deployment validation—before deployment, run automated tests: unit tests, integration tests, k6 load tests. (2) Canary deployments—deploy to 5% of servers. Monitor in OnCall for 5 minutes. If error rate spikes, auto-rollback. (3) Staged rollouts—gradually roll out: 5%, 25%, 50%, 100%. Monitor metrics between stages. (4) Smoke tests—after deployment, run synthetic transactions (e.g., test user signup). If failures, alert and rollback. (5) Baseline comparison—compare metrics after deployment to baseline: "error rate is 5x higher after deployment." Alert and rollback. (6) Change correlation—OnCall stores all deployments as events. When incident occurs, OnCall can check: "any deployments in last 1 hour?" Suggests likely cause. (7) Preventive alerting—before issue becomes incident, alert: "error rate trending up 10% over past 5min. Investigate before it becomes critical." Implement incident prevention dashboard: shows near-miss incidents (error rate spike, recovered automatically). Categorize by root cause (deployment, config, external). For repeated near-misses, investigate and implement fix. Create post-incident reviews: after every incident, capture lessons learned. Update runbooks and automation. Build incident trend analysis: is incident rate increasing/decreasing? Are incidents getting better handled? Create metrics on prevention: "X incidents prevented through canary deployment rollback."

Follow-up: Your canary deployment prevented the bug (rolled back after 5 min). But the 5-minute window was enough to corrupt data that's now stuck in replicas. Rollback prevented user-facing issue, but data is still corrupted. How do you handle data consistency during incident?

Your OnCall schedule has 50 people on-call across 3 time zones. Determining who's on-call for which service at any given time is complex. An incident fires at 3 AM; the wrong person is paged (someone not qualified for that service). They escalate; time is wasted. Design an intelligent on-call scheduling system.

Implement smart on-call scheduling: (1) Service-based schedules—each service has its own on-call rotation. Platform service has its rotation, Payment service has another. (2) Qualification tracking—track which team members are qualified for each service. Only qualify members are in rotation. (3) Timezone-aware scheduling—on-call schedules respect timezones. London team gets London on-call hours; US team gets US hours. (4) Handoff automation—at handoff (e.g., 9 AM London → London team takes over from US), OnCall updates automatically. (5) Escalation policies—define escalation chains: page on-call engineer; if no ack in 5min, page manager; if still no ack, page director. (6) Backfill capability—if on-call is unavailable (sick, on vacation), system auto-finds backfill. (7) Override system—if normal on-call should not be paged (e.g., in meeting), allow override: "page backup engineer instead." Implement on-call dashboard: shows current on-call person per service, their qualification, recent incidents they handled. Build notifications: "you're on-call tomorrow for Payment service." Remind engineers before shift. For burnout prevention, implement on-call fairness: track who's been on-call most recently, balance load across team. No one should be on-call too frequently. Create on-call runbook: "you're on-call, here's what to do." Includes escalation policy, common issues, relevant dashboards. Test scheduling: simulate multiple incidents, verify correct people are paged.

Follow-up: Your qualification-based scheduling works, but only 3 people qualify for the critical Database service. They rotate frequently, getting burned out. How do you handle knowledge silos?

An incident occurs and escalates through on-call rotation: primary doesn't respond (asleep), escalates to secondary (same time zone, also asleep), escalates to tertiary (different timezone, awake). But tertiary isn't qualified for this service. Decision-making is slow; issue propagates. Design an incident response automation system that handles decision-making.

Implement automated incident response: (1) Automatic remediation—OnCall can execute remediation actions without human decision: restart service, increase replica count, drain traffic. Define policies: "if error rate >10%, trigger auto-remediation." (2) Chatbot integration—OnCall slash commands in Slack: "/page-on-call", "/rollback-last-deployment", "/scale-up-database". Reduces friction. (3) Runbook automation—OnCall stores runbooks (shell scripts, Kubernetes commands). Can auto-execute runbook steps. (4) Approval workflow—for dangerous actions (delete data), require approval. Non-dangerous actions (restart) execute automatically. (5) Decision trees—OnCall can follow decision trees: "if CPU >90%, restart service; if persists, scale up." Automates complex logic. (6) Feedback loops—after automatic remediation, monitor metrics. If remediation worked (error rate dropped), declare success. If not, escalate. (7) Learning—over time, OnCall learns which remediations work best for which issues. Improves automation. Implement guardrails: automatic remediation has limits (max 3 auto-restarts, then require human approval). Prevents runaway automation. Build audit trail: every automated action is logged with context. Enables investigation: "why did it auto-restart 5 times?" Create approval dashboard: show pending approvals, who approved, decision reasoning. Test automation: simulate incidents, verify OnCall executes correct remediation. Create runbook: "what remediations can OnCall auto-execute?"

Follow-up: Your auto-remediation restarted the service 10 times automatically. The underlying issue was a memory leak; restarting doesn't fix it (just temporary). OnCall kept auto-restarting, wasting time. How do you prevent infinite restart loops?

A critical incident occurs affecting 100K customers. The incident escalates to you (senior engineer). Your on-call context is scattered across OnCall, Grafana dashboards, Slack messages, and runbooks. You need a unified command center to make critical decisions fast. Design an incident command center for emergencies.

Implement incident command center: (1) War room integration—OnCall provides video conference link. All responders join unified war room. (2) Unified timeline—OnCall displays incident timeline in war room: alert → context → investigation → decisions. All participants see same timeline. (3) Metric dashboard—in war room, display key metrics: error rate, latency, customer impact. Live updates. (4) Role assignments—incident commander, communications lead, tech lead, on-call engineer. OnCall manages roles and ensures clear responsibilities. (5) Decision recording—all decisions made in war room are recorded and logged. "Decision: roll back to v1.2.3 at 3:15 PM." (6) Communication discipline—OnCall enforces communication discipline: only incident commander speaks to customers/press. Tech lead focuses on tech response. (7) Status updates—OnCall auto-generates status page updates: "we're investigating the issue. Current impact: X customers. ETA: Y min." Reduce customer anxiety. Implement escalation decision tree: if incident is affecting >1M customers, escalate to VP. If <1K, incident commander handles. Create incident command training: roles, responsibilities, communication protocols. Simulate major incidents quarterly. Build post-incident review process: after incident, capture lessons (what worked, what didn't, how to prevent). Create metrics: incident severity distribution, MTTR by severity, customer impact. Test war room: simulate critical incident, verify all systems work (video, metrics, communication).

Follow-up: Your war room is well-organized, but the technical issue is complex; the incident commander (manager without deep technical knowledge) makes a wrong decision (rolls back to wrong version). Damage compounds. How do you ensure good technical decisions in high-pressure situations?

Incidents are common in your team: avg 3 incidents/week. Response is consistent, but root causes vary: sometimes deployment, sometimes third-party outage, sometimes capacity issue. You want to consolidate learning so similar issues don't recur. Design a knowledge management system within OnCall.

Implement incident knowledge management: (1) Post-incident reviews—after every incident, capture: root cause, what worked, what didn't, how to prevent. Store in OnCall knowledge base. (2) Pattern recognition—OnCall analyzes incident patterns: "third-party API timeouts happen every Tuesday at 8 PM; likely rate limit." Alert team proactively. (3) Solution catalog—maintain runbooks for known issues: "database connection pool exhausted → restart application → scale up db connections." Link from incidents to solutions. (4) Timeline templates—for recurring incidents, OnCall pre-populates timeline with likely events (alert fired, diagnosis, remediation). Speeds response. (5) Escalation learning—track which escalations helped: "paging database expert reduced MTTR by 50%." Improve escalation policies. (6) Metrics tracking—track incident metrics over time: MTTD improving? MTTR stable? Error rate decreasing? Suggests improvements are working. (7) Knowledge decay—old solutions become stale (code changes, system evolves). OnCall periodically flags old runbooks for review/update. Implement incident classification: categorize by root cause, severity, service. Query incidents: "show all database incidents in 2026." Build analytics: "most common root cause?" "slowest incidents?" "which team has highest incident rate?" Create AI-powered suggestions: when incident occurs, OnCall suggests similar past incidents and their resolutions. For critical incidents, trigger mandatory post-incident review. Publish findings to team. Create knowledge sharing: monthly incident retrospective. Team discusses patterns, prevention strategies.

Follow-up: Your post-incident reviews identify root causes, but recommendations are often not implemented. "Fix the race condition in service X" is logged, but the fix is never prioritized. Incidents repeat. How do you ensure accountability for prevention?