Terraform Interview — Drift Detection and Remediation

You discover that 15% of your infrastructure has drifted from Terraform state. A security team manually patched 40 security group rules, and someone resized RDS instances directly in AWS. You need to detect this systematically and provide a remediation report. What's your approach?

Implement three-phase drift detection: 1) Run `terraform plan -refresh-only -out=drift.plan` to detect changes without applying. 2) Parse plan file: `terraform show -json drift.plan | jq '.resource_changes[] | select(.change.actions != [])' > drift_report.json`. 3) Categorize drift: manual changes vs auto-scaling vs age-related. 4) For security group rules: compare AWS API response `aws ec2 describe-security-groups` with state. 5) For RDS: `aws rds describe-db-instances` vs state. 6) Create remediation decision tree: import drift changes back to state with `terraform import` if legitimate, or apply `terraform apply` to reset if unauthorized. 7) Document decision per resource. 8) Set up weekly drift detection lambda that posts to Slack with summary.

Follow-up: How would you distinguish between acceptable manual changes and problematic drift?

A production ALB had its target group manually updated to include a misconfigured instance. This drifted from Terraform state. Running `terraform apply` would revert the change and cause production traffic to drop. How do you safely remediate this?

Use import-based remediation with zero-downtime: 1) Identify drift: `terraform plan -refresh-only | grep target_group_attachment`. 2) Understand the legitimate change: why was the instance added? 3) Update HCL to match current AWS state: add the instance to `target_group_attachment` in code. 4) Run `terraform import aws_lb_target_group_attachment.added i-12345abcdef-tg-abc123` to import the attachment into state without applying. 5) Verify with `terraform state show aws_lb_target_group_attachment.added`. 6) Run `terraform plan` to confirm zero changes. 7) If the manual change was wrong, revert: update HCL back, run `terraform apply` with maintenance window. 8) Add preventive: set `resource "aws_autoscaling_group"` `load_balancers` in TF to prevent manual edits.

Follow-up: How would you prevent manual ALB changes in the first place with IAM policies?

You run `terraform plan -refresh-only` and discover 300 drifted resources. Many are false positives (e.g., CloudWatch logs retention set to 30 days, state shows default). How do you filter signal from noise?

Implement smart drift filtering: 1) Use `terraform apply -refresh-only -auto-approve` to update state without changes, then re-run plan. 2) Create drift filter script: parse JSON plan, ignore known drift patterns: computed fields, defaults, provider-managed resources. 3) For specific resources, add `ignore_changes` in HCL: `resource "aws_cloudwatch_log_group" "main" { ignore_changes = [retention_in_days] }` if retention is managed separately. 4) Use data sources instead of resources where appropriate: `data "aws_cloudwatch_log_group" "logs"` instead of resource. 5) Run selective refresh: `terraform refresh -target=aws_instance.main` for specific resources. 6) Compare drift report categories: expected (computed fields), acceptable (auto-scaling), critical (security, config). 7) Alert only on critical drift: script checks for specific resource types and flags for action.

Follow-up: How would you automate categorization of false positives across hundreds of resources?

Your database parameter group drifted: a DBA manually changed `max_connections` via AWS Console for performance tuning. Three hours later, Terraform resets it during a deploy. How do you coordinate manual changes with infrastructure-as-code?

Implement change negotiation workflow: 1) Before allowing manual changes, require documentation: ticket, approval, temporary vs permanent. 2) For temporary tuning, use Lambda to auto-remediate: detect drift, post to Slack for approval before acting. 3) For permanent changes: 1a) DBA submits PR updating parameter group values in HCL. 1b) Code review approves. 1c) Run `terraform plan` to validate. 1d) Deploy. 4) Make manual changes harder: IAM policy allows RDS parameter changes only via approved process. 5) Use feature flags in TF: `variable "allow_manual_overrides" { default = false }` to temporarily permit manual changes during incidents. 6) Set up drift detection alert: when drift detected in critical resources, post to on-call immediately. 7) Document policy: parameter changes must be TF-first except emergency incidents.

Follow-up: How would you implement automated approval for emergency manual changes with audit logging?

You're designing drift remediation for a 1000-resource infrastructure. Auto-remediation via `terraform apply` is too risky. But manual review of every drift is unsustainable. Design a tiered remediation system.

Implement three-tier remediation: Tier 1 - Automatic safe remediation: ignore_changes on computed fields, refresh state for read-only operations. Use `terraform apply -refresh-only` for tier 1. Tier 2 - Approval workflow: non-critical drift (tags, metadata) requires single approval via GitOps PR. Run plan, create PR with changes, on-call reviews, merges. Tier 3 - Emergency review: critical drift (security groups, databases, IAM) requires 2 approvals plus incident ticket. Send to Slack channel for visibility. Implement categorization: tag each resource with risk level in TF. Script reads tags, routes drift to appropriate tier. Use Terraform Cloud policy enforcement: `sentinel` policies auto-reject high-risk plans. Maintain remediation queue: `aws dynamodb` table tracks drift, status, approvals, remediation time.

Follow-up: How would you measure remediation effectiveness and track drift reduction over time?

A network engineer manually added 5 security group rules to handle a one-time incident. Three months later, you don't know if these are still needed. Running `terraform apply` could break production. How do you safely retire these rules?

Implement safe retirement workflow: 1) Find drift: `terraform plan -refresh-only | grep security_group_rule`. 2) Investigate: check when rule was added, what it does, is incident still active? 3) Create temporary HCL: add rule to code with comment: `# Temporary incident rule from 2024-03-15 - ticket INC-12345`. 4) Import rule: `terraform import aws_security_group_rule.temp sg-abc123/ingress-tcp-443`. 5) Run `terraform plan` to verify zero changes. 6) Set scheduled destruction: add comment with sunset date, create calendar reminder. 7) Before sunset: check if rule still needed with ticket stakeholders. 8) If retired: remove from HCL, run `terraform apply`. 9) If needed: update comment with new sunset date and ticket reference. Use Terraform Cloud `cost estimation` to show impact of rule removal on deployments.

Follow-up: How would you automate detection of drifted rules past their sunset date?

You inherit infrastructure where Terraform state shows 100 resources but AWS has 150. You find `terraform import` was never run for 50 resources that are actually managed by Terraform code. How do you reconcile?

Use three-step reconciliation: 1) Audit gap: diff state vs AWS. For each missing resource in state, check HCL: is it defined in code? 2) If in code but not state: `terraform import aws_instance.main i-12345`. Run `terraform plan` after each import to verify zero changes. 3) If resource not in code or state: decide: is it orphaned (remove from AWS)? Is it managed by other tool? Or should it be managed by TF? 4) Bulk import: write script to import multiple: `for resource in $(aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' --output text); do terraform import aws_instance.prod[\"$resource\"] $resource; done`. 5) Validate after each import: `terraform plan -target=aws_instance.prod` to ensure no planned changes. 6) Document import: create PR showing before/after state, import commands, and validation results.

Follow-up: How would you automate detection of this state-reality gap continuously?

Your infrastructure relies on a third-party service that modifies resources after Terraform creates them (e.g., monitoring agent installs, auto-scaling policies). These show as drift but are legitimate. How do you manage this?

Use `ignore_changes` for third-party managed attributes: `resource "aws_ec2_instance" "main" { ignore_changes = [user_data, ami] }` to ignore drift from agent installations. For auto-scaling policy drift: `resource "aws_autoscaling_group" { ignore_changes = [desired_capacity, min_size] }` if auto-scaling dynamically adjusts. Separate concerns: use `data` sources for read-only attributes: `data "aws_instance" "info" { instance_id = aws_instance.main.id }` to query current state without managing it. Create integration checklist: document which third-party service manages which attributes. Use tagging: tag resources with `managed_by = "third-party-x"` and set up drift alerts to ignore specific tags. Communicate in code: add comments explaining why attributes are ignored. Implement validation: run tests that third-party modifications don't break Terraform deployments.

Follow-up: How would you alert if third-party services fail to apply their changes (absence of expected drift)?