You're designing CI/CD for Terraform that prevents two engineers from applying conflicting changes simultaneously. Currently both PRs merge and both run apply, causing state conflicts. Design a safe plan/apply workflow.
Implement sequential plan/apply with lock: 1) On PR: Run `terraform plan -out=plan.tfplan` in CI, save plan artifact. 2) Comment plan output on PR for review. 3) On approval: trigger apply job. Before apply, acquire lock: `aws dynamodb update-item --table-name terraform-locks --key '{"LockID":{"S":"prod"}}' --attribute-updates 'PullRequestID {Value {N "12345"} Action PUT}'`. 4) Re-run `terraform plan -out=latest.plan`, compare with saved plan. If different (another PR merged), reject: "State changed, please rebase and re-plan". 5) If plans match: `terraform apply plan.tfplan`. 6) Release lock after apply. 7) Use Terraform Cloud: native workflow prevents race conditions via workspace-level apply locks. 8) Force serial applies: GitHub Actions branch protection requires sequential merges. 9) Document: all changes go through PR, plan visible before merge, apply is automatic after merge and passes checks.
Follow-up: How would you handle emergency changes that need to bypass review?
A production `terraform apply` fails halfway through: 50 resources created, then provider error prevents completing the remaining 100. Manual cleanup is hours of work. How do you design apply to be resumable?
Use plan file persistence and idempotent operations: 1) Before applying: save plan file: `terraform plan -out=prod.plan`. 2) Apply from plan: `terraform apply prod.plan`. This is idempotent - if it fails mid-way, run same command again. 3) Terraform tracks partial state: if 50 resources created, state shows those 50. Rerunning apply creates remaining 100. 4) For non-idempotent operations (scripts), use provisioners carefully or external data sources. 5) Implement apply retry: in CI/CD, if apply fails, automatically retry up to 3 times. 6) Add apply timeout: `timeout 30m terraform apply` to detect hung applies. 7) Use -target for resuming specific resource: `terraform apply -target=aws_instance.main` applies just that resource if earlier apply failed. 8) Document failure procedure: if apply fails, never manually create remaining resources - let Terraform retry create them. This prevents state divergence.
Follow-up: How would you notify the team immediately when apply fails and needs manual investigation?
Your PR shows a safe plan (just tags changing). But between PR approval and apply merge, another PR merged and created the resource this PR was about to modify. Now apply would fail. How do you detect and handle this?
Detect plan staleness before applying: 1) When applying, re-run plan: `terraform plan -out=current.plan` just before apply. 2) Compare plans: `terraform show current.plan | diff - old_plan.txt`. If different, block apply. 3) Automatically rebase: In CI, detect conflict, comment on PR: "Plan is stale due to merged PRs. Rebasing..." 4) Rebase script: `git fetch origin && git rebase origin/main && terraform plan`. 5) Update PR with new plan, auto-approve if still safe (e.g., just tags). 6) Re-trigger apply. 7) Terraform Cloud does this automatically: VCS-driven workflow updates plan when new commits land. 8) Implement stale plan protection: max age of plan before replan required. 9) Use branch protection: require latest commit to be tested before merge, preventing race conditions.
Follow-up: How would you automatically determine if rebased plan is still safe vs requires manual review?
You want to allow certain team members to plan but not apply (read-only). Others can both plan and apply. Design role-based access in your CI/CD workflow.
Implement role-based workflow: 1) In GitHub/GitLab, create teams: `terraform-planners` (read-only), `terraform-appliers` (can approve). 2) Require approval from `terraform-appliers` team for PR merge. 3) In CI/CD pipeline: separate jobs for plan and apply. 4) Plan job runs for all team members. 5) Apply job requires: `if: github.actor in ['approved-list']` or `if: contains(github.actor_teams, 'terraform-appliers')`. 6) Add approval workflow: approval comment like `/terraform apply` triggers apply job only if commenter is in appliers team. 7) Use AWS IAM: CI role has different permissions per environment. Dev role can plan/apply anywhere, prod role needs assume-role with MFA. 8) Log who triggered apply: commit to audit log showing user, time, resource count. 9) Terraform Cloud: workspace permissions control who can apply. Set specific workspace roles.
Follow-up: How would you handle an on-call engineer needing emergency apply access?
A team member accidentally approved a risky PR that deletes production database. It merges before anyone notices. The deploy pipeline hasn't run yet. Can you prevent the apply?
Implement policy and circuit-breaker safety: 1) Add Sentinel or OPA policy: terraform apply automatically blocked if plan contains `destroy` on critical resources. Policy: `if rs.type == "aws_rds_instance" and operation == "delete" { deny() }`. 2) Add manual approval gate: even if CI passes, prod applies require 2nd human approval. Store in Slack: `/terraform approve main-db-deletion --force` requires manager signature. 3) Implement resource protections: `lifecycle { prevent_destroy = true }` on critical resources. 4) Add pre-apply alert: script checks for destroy operations, posts to Slack #infractions channel. 5) Delay apply: prod applies have 1-hour delay after merge, allowing time to catch issues. 6) Implement rollback: if delete happens, keep backup from yesterday, restore from snapshot. 7) Add monitoring: CloudWatch alert if RDS deleted, pages on-call immediately.
Follow-up: How would you ensure this safety layer doesn't block legitimate database migrations?
You're migrating from CI/CD managed Terraform to Terraform Cloud. Old workflow: GitHub Actions runs plan/apply. New: Terraform Cloud is source of truth. How do you transition safely?
Gradual migration with parallel validation: 1) Set up Terraform Cloud workspace for non-prod first. 2) Both old CI/CD and new TFC run plans, compare outputs. Script diffs: `terraform-old-plan.json` vs `tfc-plan.json`. Ensure zero differences. 3) Once validated: switch prod to TFC. 4) Reconfigure backend: `terraform cloud config` in local/CI environment. 5) Migrate state: `terraform cloud state push local-state.tfstate`. 6) Archive old CI/CD pipeline: keep for rollback. 7) Test: plan/apply in TFC matches old behavior. 8) Remove old pipeline: delete GitHub Actions workflow once TFC stable for 2 weeks. 9) Document: show team how to approve/apply via TFC UI instead of CI. 10) For emergency rollback: keep local terraform binary + state backup to revert to old workflow if TFC fails.
Follow-up: How would you handle team members that prefer the old CLI-based workflow?
Your apply produces plan showing 1500 resource changes. Reviewing each is impossible. You need confidence this apply is safe before running it. What checks do you implement?
Implement automated safety checks on large plans: 1) Parse plan: `terraform show -json plan.tfplan | jq '.resource_changes[] | {address, actions}' > changes.json`. 2) Categorize changes: creates vs updates vs deletes vs read-only. 3) Alert on deletes: if plan contains destroy operations, require extra approval. 4) Check for risky updates: changes to security groups, IAM, databases trigger additional review. 5) Validate no unexpected resources: check resource count isn't anomalously high. If 1500 when expecting 50, require investigation. 6) Use cost estimation: show estimated cost delta. If huge increase, investigate. 7) Implement targeting: instead of applying all 1500, apply by service/module: `terraform apply -target=module.auth` then `terraform apply -target=module.api`, etc. 8) Use Terraform Cloud cost estimation dashboard. 9) Set approval threshold: plans with > 100 changes require manager approval.
Follow-up: How would you identify which changes are safe to batch vs need sequential applies?
A team member rolls back their PR before it merges, but the CI pipeline already started the apply. The apply is running and will delete the resources that were just un-created. How do you prevent this?
Implement apply guards against stale plans: 1) Detect PR closure: webhook triggers when PR closed before merge. 2) If apply in progress: cancel it. GitHub Actions: `if-no-ref` cancels jobs if branch deleted. 3) Require plan freshness: before applying, check plan age. If > 1 hour or source code changed since plan, replan. 4) Use VCS-driven workflow: Terraform Cloud only applies on merge to main, not on branch updates. 5) Implement check: before apply, verify branch still has approvals (if approvals were revoked, block apply). 6) Add notification: when apply cancelled due to PR closure, notify team on Slack. 7) Use commit SHA: plan includes commit SHA. Before applying, verify current commit is that SHA. If different, replan. 8) Implement safety: separate plan and apply into different jobs, linked by plan artifact. If plan job re-runs (due to PR update), invalidates artifact for apply.
Follow-up: How would you handle a case where a developer legitimately needs to roll back part of an already-applied state?