Terraform Interview — Blast Radius and Targeted Apply

You're rolling out a small change to production: update a Lambda function's timeout. Running full `terraform apply` would affect 500+ resources. A typo could destroy databases. How do you safely apply just the Lambda change?

Use targeted apply to limit blast radius: 1) Run plan with target: `terraform plan -target=aws_lambda_function.handler -out=plan.tfplan`. Only Lambda plan shown. 2) Verify plan shows only intended change: timeout change. No other resources affected. 3) Apply target: `terraform apply plan.tfplan`. This applies only Lambda change. 4) Verify: check Lambda in AWS Console. Timeout updated. Other resources untouched. 5) Full validation: after targeted apply, run full plan to ensure no side-effects: `terraform plan -refresh=only`. Should show zero changes. 6) For safety: -target is risky. Use only for specific scenarios: urgent fixes, isolated changes. 7) Document why targeted apply used. 8) Communicate: notify team that partial apply happening. If something breaks (Lambda fails), know targeted apply was recent. 9) Automate detection: script checks if targetted apply used in production. Alert ops team. 10) Fallback: if targeted apply causes issues, full apply might conflict. Keep backup state: `terraform state pull > backup.tfstate` before targeted apply.

Follow-up: How would you ensure that a targeted apply doesn't create inconsistencies with dependent resources?

Your prod deployment has 200 resources. You need to change a VPC's CIDR. This change depends on 50+ resources (security groups, subnets, route tables). Running apply on just VPC would break those 50. How do you handle dependent resources in targeted apply?

Include dependencies in targeted apply: 1) Identify dependency chain: VPC CIDR change -> subnets need update -> security groups referencing subnets need update. 2) Target multiple resources: `terraform plan -target=aws_vpc.main -target='aws_subnet.private[0]' -target='aws_subnet.private[1]' -target='aws_security_group.app' -out=plan.tfplan`. Include VPC and all dependent resources. 3) Script to automate targeting: `terraform graph | grep -A5 aws_vpc.main` shows dependencies. Extract resource addresses, add to targets. 4) Terraform auto-target: `terraform plan -target=aws_vpc.main` doesn't automatically include dependents. You must add them manually. 5) Safer alternative: split into modules. Prod vpc in `terraform/prod-network/`, apps in `terraform/prod-compute/`. Change VPC in network module only. Compute module reads via remote state. 6) Validate: after targeted apply, run full plan: `terraform plan`. Should show zero changes if all dependencies handled. 7) Test first: do targeted apply in staging first. Verify no side-effects. 8) For risky changes: consider maintenance window where full apply is safer than partial.

Follow-up: How would you automate detection of missing dependencies in a targeted apply?

You use targeted apply for quick fixes, but this leads to state drift: some resources updated via apply, others not. State becomes unpredictable. A later full apply surprises everyone with massive changes. How do you prevent this?

Minimize blast radius without creating drift: 1) Avoid targeted apply as default. Use only in emergencies. Establish policy: targeted apply requires VP approval. 2) For routine changes: full apply with change validation. Parse plan, verify only expected resources change: `terraform show -json plan.json | jq '.resource_changes | length'` shows count. Validate it matches expected. 3) High-risk resource protection: mark critical resources with `lifecycle { prevent_destroy = true }` and add to approval gate. If plan includes destroy on protected resource, require extra approval. 4) Smaller state files: instead of 200 resource state, split into modules of 30-50 resources each. Apply feels less risky. 5) Policy enforcement: Sentinel policy detects targeted apply usage: `if terraform.command == "apply" and resource_targets != null { fail() }`. Disallow in prod. 6) Documentation: when targeted apply acceptable: emergency bug fix (connection issue), env-specific config (dev-only resource). When NOT: structural changes, cross-module updates. 7) Monitoring: log all applies (targeted or full). Dashboard shows usage pattern. Alert if too many targeted applies. 8) Training: show team that full plan review is safer than partial.

Follow-up: How would you roll back if a targeted apply created unexpected drift?

A service team needs to fix a bug in their Lambda function. They can't wait for full Terraform deploy (takes 30 min with testing). They want to update Lambda directly, outside Terraform. But this breaks IaC. Design emergency process.

Design emergency change process with reconciliation: 1) Emergency protocol: service team can update Lambda directly via AWS Console with ticket. But must reconcile within 24 hours. 2) Update ticket: record what changed (Lambda handler, timeout, env vars). 3) Update HCL: service team or ops team updates Terraform to match AWS reality. 4) Import: `terraform import aws_lambda_function.handler function-name`. 5) Validate: run `terraform plan`. Should show zero changes if HCL matches AWS. 6) Apply: `terraform apply`. This restores IaC management. 7) Monitoring: CloudWatch alert if Lambda changed outside Terraform. Ops reviews alerts weekly. 8) Preventive: allow targeted apply for Lambda updates. Less risky than emergency changes. 9) Process: instead of direct update, service team creates PR with Lambda change. Ops approves targeted apply. 10) SLA: emergency changes must reconcile within 24 hours or reverted: `terraform apply` undoes changes. 11) Root cause: after emergency resolved, investigate why service team needed emergency. Add Lambda targeted apply to normal process? Add more test automation to speed up deploy?

Follow-up: How would you prevent emergency changes from accumulating over time?

You're evaluating blast radius before applying changes. Plan shows 150 resource changes. Most are benign (tag updates), but 3 are concerning (database encryption enable, IAM policy change, security group removal). How do you analyze and categorize blast radius?

Implement blast radius analysis: 1) Parse plan JSON: `terraform show -json plan.json | jq '.resource_changes[]' > changes.json`. 2) Categorize by impact: a) Benign: tag updates, metadata changes. No service impact. b) Minor: compute resource changes, autoscaling adjustments. Brief service disruption. c) Major: networking, security group, IAM changes. Potential outage. d) Critical: database changes, encryption, deletions. High risk. 3) Script categorization: create rules for each type: `if resource_type == "aws_security_group" and action == "destroy" { critical }`. 4) Generate report: summary by category. "150 total changes: 100 benign, 30 minor, 15 major, 5 critical". 5) Flag high-impact changes: for each critical, require extra approval. 6) Cost analysis: Terraform Cloud estimates cost. Alert if cost increases > 10%. 7) Security analysis: check for security group changes that open inbound. Flag if overly permissive. 8) Approval gate: changes with 5+ critical require 2 approvals. 1-4 critical require 1. 0 critical require only code review. 9) Runbook: show ops team which critical changes are planned and expected (e.g., enabling encryption scheduled for this release).

Follow-up: How would you automate blast radius assessment and prevent high-risk changes from merging?