Grafana Interview — Transformations and Data Manipulation

Your dashboard shows raw database query latencies (microseconds). Users want to see latencies in milliseconds, percentiles (P50, P95, P99), and aggregated by endpoint. Creating separate queries for each transformation is cumbersome. Design a transformation system that lets teams transform query results flexibly.

Implement rich transformation pipelines: (1) Unit conversion—transform microseconds to milliseconds, bytes to MB, etc. (2) Aggregation—group by endpoint, sum/avg/min/max queries. (3) Percentile computation—calculate P50, P95, P99 from results. (4) Time-series math—perform math on results: divide timeseries, multiply, add constants. (5) Filtering—filter rows: "only show results where latency > 100ms." (6) Mapping—map values: latency >1s = "red", <100ms = "green". (7) Ordering—sort results by metric, latency, etc. Implement transformation UI: visual blocks representing each transformation. Users can drag/drop to reorder, preview output. This makes transformations discoverable. Create a transformation library: pre-built transformations for common patterns (percentile calculation, unit conversion, aggregation). Users drag/drop library transformations, then customize parameters. For complex transformations, allow scripting (custom JavaScript or Go expressions). Implement preview: as users build transformation pipeline, show sample of transformed output live. Catch errors early. Build a transformation debugging dashboard: shows input → transformation 1 → transformation 2 → ... → output. Users can trace where data is transformed. Store transformation pipelines in Git: export as JSON, version control. Reuse across dashboards. Create documentation with examples: "how to calculate P95 latency", "how to group by service", etc. Provide copy-pasteable transformation configs.

Follow-up: Your transformation pipeline is complex (5 transformations, custom scripts). A junior developer copies it without understanding. When underlying data changes, the transformation breaks. How do you make transformations robust?

A dashboard query returns 10K time series. Your transformation needs to keep only top-10 series by median value, then aggregate the rest as "others". Currently, this requires manual post-processing outside Grafana. Design a transformation system that handles high-cardinality data reduction.

Implement cardinality reduction transformations: (1) Top-K filtering—keep top K series by metric (sort by median, limit 10). (2) Tail aggregation—aggregate remaining series into "others". (3) Percentile filtering—keep only series with values in top percentile (P90+). (4) Bucketing—group series into buckets (histogram-like), aggregate each bucket. (5) Sampling—uniformly sample K series from all results. (6) Thresholding—drop series below threshold value. (7) Cardinality limiting—enforce max cardinality: if >N series, auto-truncate to top N. Implement preview for cardinality reduction: show before/after. "Original: 10K series, After filtering: 15 series." This helps users understand impact. Create smart defaults: analyze incoming data cardinality, auto-suggest appropriate filtering (e.g., "your query returns 5K series; suggest keeping top-50"). Build performance optimization: transformations that reduce cardinality early in pipeline improve overall query performance. Prioritize these. For debugging, provide detailed logging: which series were filtered, why. "Series 'host-5' filtered: median=5ms (rank #500 of 10K)." This helps users understand decisions. Implement "explore filtered data": users can click on "others" aggregation to see which series are included. Drill-down capability. Test cardinality scenarios: query 10K, 100K, 1M series. Verify transformations handle without crashing. Provide runbook: "my dashboard shows only 15 series, but I need to see more. How do I adjust filtering?"

Follow-up: Your top-10 filtering keeps the same 10 series consistently, even if rank changes. Over time, a new series rises to top-3 but is still filtered out (not in original top-10). How would you handle dynamic top-K updates?

Your dashboard joins data from two different datasources (Prometheus metrics + Elasticsearch logs). A transformation needs to correlate them: match logs to metrics by request_id, combine into single result set. Grafana doesn't natively support cross-datasource joins. Design a transformation system for data integration.

Implement cross-datasource transformation and joining: (1) Multi-query support—allow dashboard panels to combine results from multiple queries (Prometheus + Elasticsearch). (2) Join condition—specify join key: correlate by request_id, trace_id, etc. (3) Inner/outer joins—support inner join (only matching records), left/right/full outer joins. (4) Deduplication—on join key, combine duplicate records. (5) Field mapping—map fields from different datasources to common schema (rename, reorder). (6) Aggregation on join—after join, aggregate: group by request_id, count matching logs, compute metrics. (7) Error handling—handle mismatches gracefully: if Prometheus has request X but Elasticsearch doesn't, how to handle? Implement a join builder UI: visual interface to drag/drop queries and define join conditions. Preview joined results. Create join templates: pre-built joins for common patterns (metrics + logs by trace_id, metrics + events by timestamp). For performance, implement smart join strategies: if one dataset is much larger, use it as the fact table. Index on join key for fast lookups. Build debugging tools: show join statistics (X records from Prometheus, Y from Elasticsearch, Z matched). Highlight unmatched records. Test join scenarios: matching joins, non-matching joins, skewed datasets. Verify correctness and performance. For production use, document join patterns and performance expectations. Create runbook: "joining metrics and logs by trace_id." Provide examples.

Follow-up: Your join is working, but combining 1M Prometheus metrics with 10M Elasticsearch logs (Cartesian product) causes memory explosion. The join crashes. How would you optimize large-scale joins?

Your transformation creates a new calculated field: (CPU_usage + Memory_usage) / 2. But if either metric is missing (null), the calculation fails silently, returning no data. You can't debug which metrics are missing. Design error handling and observability for transformations.

Implement transformation debugging and error handling: (1) Null handling—explicit null handling: if CPU is null, use 0, or skip row, or mark as error. (2) Error messages—show clear errors: "CPU_usage is missing for host-1; skipping calculation." (3) Error tracking—count errors per transformation: "50 rows failed calculation due to missing CPU." (4) Partial results—return partial results with error metadata. Users see what worked, what failed. (5) Fallback values—specify fallbacks: if CPU null, use previous value, or default to 0. (6) Validation rules—define validation: "CPU should be 0-100%." Flag out-of-range values. (7) Debug mode—enable verbose debug logging: show intermediate results after each transformation step. Implement error rate monitoring: dashboard showing error rate by transformation. Alert if error rate exceeds threshold (e.g., >10% of results failed). Build drill-down: users can click on error, see sample of failed rows, diagnose issue. Create transformation error recovery UI: when transformation fails, suggest fixes. "Missing CPU values for 5 hosts. Show as 0? Skip row? Use previous value?" For compliance, maintain error audit log: all transformation errors logged with context. Implement transformation unit tests: teams can write tests for their transformations (input → expected output), validate before deployment. Test failure scenarios: missing values, out-of-range values, type mismatches. Verify graceful handling. Create runbook: "debugging transformation errors." Steps for investigating failures.

Follow-up: Your error handling is verbose, but on-call engineer is overwhelmed with 100K error rows logged. Can't see the signal in the noise. How would you summarize errors intelligently?

Your dashboard has complex transformations. When underlying data changes (new metrics available, old metrics removed), transformations break. Users don't know what changed. Design a transformation system that's resilient to schema changes.

Implement schema-aware transformations: (1) Schema versioning—track schema version: which fields exist, data types. (2) Field mapping—define alternative field names: if 'request_latency' is missing, try 'latency' or 'response_time'. (3) Type coercion—handle type mismatches: string "123" → number 123. (4) Missing field handling—explicit strategy when field is missing: skip transformation, use default, or error. (5) Schema validation—before executing transformation, validate all required fields exist. Alert if not. (6) Migration guides—when schema changes, provide migration guide for transformations: "field 'error_count' was renamed to 'error_rate'. Update your transformations." (7) Versioned transformations—allow transformations to target specific schema version. Old transformations continue working even if schema evolves. Implement automatic field detection: when users build transformations, system suggests available fields. If field is missing, offer alternatives. Build schema change alerts: when new fields appear in query result, or existing fields disappear, alert and offer transformation update. Create transformation linting: scan transformations for potential schema issues. "Using field 'request_latency', but only 'latency' is available. Did you mean 'latency'?" For rollout, provide dry-run: run transformation against new schema in sandbox, show what would happen. Implement field mapping UI: visual mapping of old fields to new fields during schema evolution. Test schema evolution: simulate adding/removing/renaming fields. Verify transformations adapt gracefully. Create runbook: "schema changed, how do I update my transformations?"

Follow-up: Your schema migration guide says "rename error_count to error_rate", but teams misinterpret: some use error_count/total_requests (rate), others use raw count. Results diverge. How do you enforce semantic correctness?

Your team uses transformations for complex business logic: calculate customer lifetime value, segment by engagement tier, score for risk. This logic is duplicated across 50 dashboards. Maintaining consistency is painful: a logic bug in one place needs fixing in 50 places. Design a system for reusable, version-controlled transformation logic.

Implement reusable transformation libraries: (1) Transformation functions—define reusable functions: ltv(revenue, months), engagement_tier(activity_score), risk_score(churn_rate). (2) Function registry—centrally store all functions (Git-tracked). Version each function. (3) Library dependencies—transformations can use other transformations as building blocks. (4) Versioned imports—dashboards specify which version of function they use: "use ltv v2.1". Enables evolution without breaking old dashboards. (5) Function testing—unit tests for each function: test inputs → verify outputs. Catch logic bugs before deployment. (6) Documentation—each function has documentation: parameters, output, examples. Teams know how to use. (7) Permission control—sensitive functions (risk scoring) require approval before deployment. Non-sensitive can auto-deploy. Implement IDE-like editing: provide editor for creating/testing transformation functions. Syntax highlighting, auto-complete, inline documentation. Build function versioning: when function logic changes, increment version. Old dashboards stay on v2, new dashboards use v3. Easy evolution. Create central transformation library UI: list all available functions, search, drill into docs. Users browse available options instead of re-implementing. Test transformation functions: run against test data, verify outputs, performance (timeouts for expensive operations). For complex functions, implement gradual rollout: deploy function as v1-beta, let teams opt-in. Collect feedback. Once stable, promote to v1-stable. Create runbook: "adding a new transformation function." Steps for design, implementation, testing, deployment. Document best practices: keep functions small and composable, avoid side effects, ensure deterministic behavior.

Follow-up: Your ltv v3 function has better logic than v2. You want all dashboards to upgrade to v3. But some dashboards have hardcoded dependencies on v2 behavior. Forcing upgrade breaks them. How do you migrate functions without breaking dependents?