Grafana Interview — Database Backend: SQLite vs PostgreSQL

You're deploying Grafana in a remote field location with unreliable internet. A managed PostgreSQL database in the cloud won't work (too much latency, frequent disconnections). You're considering using SQLite (local file-based database) for Grafana's metadata. Compare SQLite vs. PostgreSQL tradeoffs for this scenario.

Evaluate SQLite vs PostgreSQL: (1) SQLite advantages—no network dependency (works offline), zero configuration, small footprint (single file), perfect for edge deployments. (2) SQLite disadvantages—single-writer model (limited concurrency), no clustering/replication, file-based (vulnerable to corruption if server crashes mid-write). (3) PostgreSQL advantages—robust replication, clustering, concurrent writes, network access enables offsite backups, high availability. (4) PostgreSQL disadvantages—network dependent (edge location would struggle), complex setup, higher resource overhead. For edge deployment, SQLite is appropriate. Set up: (1) Run Grafana with SQLite backend on edge server. (2) Backup SQLite database file locally every hour. (3) Implement WAL (Write-Ahead Logging) to protect against crashes. (4) Sync to cloud when network is available: upload backup to S3. (5) On network recovery, sync any accumulated changes. Implement data resilience: regular backups + WAL protect against data loss. If Grafana crashes, WAL recovery rebuilds database. Test failure scenarios: sudden power loss, network disconnection, disk full. Verify SQLite handles gracefully. For scalability, note: if edge location grows (50 users) and concurrent access increases, SQLite becomes a bottleneck (write concurrency limited). Plan migration to PostgreSQL when needed.

Follow-up: Your SQLite deployment is working, but a network outage lasts 2 weeks. The edge server can't sync backups. The SQLite database fills to 10GB. When network recovers, sync catches up but takes 6 hours. Teams are blocked. How would you handle extended offline periods?

Your Grafana deployment uses PostgreSQL, but you're hitting PostgreSQL limits: 100K dashboards, 1M saved searches, database queries take 5+ seconds. PostgreSQL can't scale further (query optimizer struggles). Design a migration strategy to scale Grafana metadata storage.

Implement scalable metadata storage: (1) Sharding—partition dashboards by org/team. Each partition in separate database. Grafana routes queries to correct shard. (2) Read replicas—queries go to read-only replicas; writes go to primary. Replicas are read-only, scale independently. (3) Cache layer—frequently-accessed dashboards (popular, recently used) cached in Redis. Queries hit cache first. (4) Archive old data—dashboards unused for 1 year are archived to cold storage (S3). Reduces active database size. (5) Search optimization—implement ElasticSearch for full-text search (dashboard names, descriptions). PostgreSQL handles exact queries only. (6) Denormalization—pre-compute common queries (dashboard count by service, most-popular dashboards). Store pre-computed results, update daily. (7) Incremental updates—when dashboard changes, update only deltas (not full rewrite). Reduces database load. Test sharding: query distributed dashboards, verify latency doesn't increase. Implement query routing that hides sharding from clients. For migration, run parallel systems (PostgreSQL + new storage) temporarily, sync data, then cutover. Measure performance improvement: latency before/after. Create monitoring: database query latency, cache hit rate, replica replication lag.

Follow-up: Your sharding splits databases, but cross-shard queries (show dashboards from all teams) now require querying all shards, then merging results. This is slow. How do you handle queries spanning shards?

You're migrating Grafana from SQLite to PostgreSQL for production scale. 1M dashboards need to be migrated. The migration takes 6 hours (read from SQLite, write to PostgreSQL). During those 6 hours, Grafana must keep serving (read-only). Design a zero-downtime migration.

Implement zero-downtime migration: (1) Dual-write—new Grafana instances write to both SQLite and PostgreSQL for overlap period (1 week). Ensures both systems in sync. (2) Read from primary—during dual-write, read from SQLite (primary). PostgreSQL is warm. (3) Validation phase—before cutover, validate PostgreSQL has all data. Run reconciliation: count records, sample spot-checks. (4) Cutover window—during planned maintenance window, final sync (latest changes), then switch read to PostgreSQL. (5) Fallback plan—if PostgreSQL fails after cutover, rollback to SQLite within 1 minute. (6) Verify correctness—after cutover, verify dashboards load correctly. Sample 10% of dashboards. (7) Archive old data—once PostgreSQL is primary and stable (1 week), archive SQLite for backup. For high-traffic services, stagger migration: migrate 10% of dashboards first, validate, then migrate next 10%. Total 7-day migration window, but traffic is never fully blocked. Implement continuous validation: during migration, spot-check migrated dashboards. Alert if discrepancies. Create migration runbook: step-by-step procedure, rollback procedure, monitoring during migration.

Follow-up: During your cutover, a dashboard query in PostgreSQL returns different results than SQLite (schema mismatch). You realize migration didn't preserve some field types. Rollback to SQLite. Now you're stuck. How do you fix schema issues during migration?

Your PostgreSQL Grafana backend is deployed in a single region. If database goes down, all Grafana instances lose connectivity (can't load dashboards). You need HA for the database. Should you use PostgreSQL with manual failover, managed PostgreSQL with automatic failover, or something else? Evaluate options.

Evaluate HA options: (1) Manual PostgreSQL failover—set up primary and replica. If primary fails, manually promote replica. Downtime: 5-10 minutes (detection + promotion). (2) Automated failover (Patroni, etcd)—setup primary/replicas, use Patroni for monitoring and automatic promotion. Downtime: <1 minute (automatic). (3) Managed database (RDS, Cloud SQL)—cloud provider handles failover automatically. Downtime: <30 seconds (cloud SLA). (4) Multi-region primary-primary—replicate database across regions. Both regions can serve writes. Complex, eventual consistency. For critical Grafana, use managed database or Patroni. Automated failover reduces MTTR significantly. For cost, Patroni is cheaper (DIY on VMs), managed is more expensive but operationally simpler. Implement: (1) Use managed database with automatic failover. (2) Set up read replicas in different AZ for read load distribution. (3) Set up backup replication to different region (point-in-time recovery). (4) Monitor database health: replication lag, failover events. Alert on issues. (5) Test failover monthly: fail primary, verify replica promotes, applications reconnect. (6) Disaster recovery: if entire region fails, use backup in another region. Recovery time: 1-2 hours. Create runbook: "database failover procedure," "disaster recovery procedure."

Follow-up: Your managed database failover works, but during failover, connections fail for 30 seconds. Grafana instances can't reconnect. You need more elegant connection handling. How would you implement transparent reconnection?

Your Grafana instance stores sensitive data in PostgreSQL: datasource credentials, API tokens, user emails. A database administrator should not have access to this sensitive data. But they need to manage database schema, backups, etc. Design a system for protecting sensitive data in the database.

Implement sensitive data protection: (1) Encryption at rest—enable PostgreSQL native encryption (PgCrypto, Transparent Data Encryption). Unencrypted data is inaccessible even if database file is stolen. (2) Field-level encryption—sensitive fields (secrets, tokens) are encrypted before storing in database. Only Grafana application (with keys) can decrypt. (3) Role-based access—database roles with limited permissions. DBA role can access schema but not sensitive data. (4) Audit logging—all access to sensitive fields is logged. DBA accessing passwords triggers alert. (5) Key management—encryption keys stored separately from database (AWS KMS, HashiCorp Vault). Database doesn't store keys. (6) Backup encryption—backups are encrypted. Even if backup is stolen, data is inaccessible. (7) Redaction—in logs and exports, sensitive data is redacted. DBAs see "***" instead of actual secrets. Implement zero-knowledge backup: backups are encrypted with key unavailable to DBAs. Only Grafana can decrypt. Create data classification policy: which fields are sensitive, which encryption required. Build audit dashboard: tracks who accessed sensitive fields, when. Alert on unauthorized access. Test security: simulate DBA trying to access sensitive data, verify encryption prevents it. Create runbook: "managing sensitive data in Grafana database."

Follow-up: Your field-level encryption works, but querying encrypted fields is slow (decryption for every row). A query filtering by encrypted field takes 1 minute. How do you support queries on encrypted fields efficiently?