Terraform Interview — Import, State Mv, and State Surgery

Your infrastructure team discovered 40 EC2 instances created manually in AWS that should have been created by Terraform. Writing HCL from scratch is tedious. How do you bulk import these instances into Terraform state and create the HCL?

Use bulk import with code generation: 1) List instances: `aws ec2 describe-instances --query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`].Value|[0]]' --output text`. 2) For each instance, create resource block in HCL: `resource "aws_instance" "imported_i_12345" { ami = "..." instance_type = "t3.medium" ... }`. 3) Import: `terraform import aws_instance.imported_i_12345 i-12345`. Repeat for all. 4) Verify: `terraform state list | wc -l` shows 40 items. 5) Export state to JSON: `terraform state pull | jq .resources > imported.json`. 6) Generate HCL from JSON: use script or tool like `terraformer` to convert state to HCL automatically. 7) Refactor generated HCL: consolidate into modules for reusability. 8) Add missing attributes: review generated HCL, add tags, security groups, ebs settings manually if incomplete. 9) Run `terraform plan` to validate zero changes.

Follow-up: How would you handle instances with complex security group configurations in the bulk import?

You imported a module `old_app` into state, but now need to rename it to `new_app_v2` across all references. The module spawns 50+ child resources. Using `terraform state mv` manually for all 50 would be error-prone. How do you safely rename the module?

Use batch state move with validation: 1) List all resources in module: `terraform state list module.old_app` > resources.txt`. 2) Create move script: for each line, run `terraform state mv module.old_app.TYPE.NAME module.new_app_v2.TYPE.NAME`. 3) Or use Terraform 1.1+ moved blocks: in HCL add `moved { from = module.old_app to = module.new_app_v2 }` for the module, then `terraform apply` handles all moves. 4) Update HCL: rename `module "old_app"` to `module "new_app_v2"` in code. 5) Run `terraform plan` to verify zero planned changes after moves. 6) Verify with `terraform state list | grep new_app_v2 | wc -l` to confirm all 50+ moved. 7) For individual resources, use `moved { from = aws_instance.old to = aws_instance.new }` blocks. This is cleaner than manual `state mv` commands.

Follow-up: How would you handle outputs and variable references that still point to the old module name?

You have 200 security group rules stored in state. A refactoring changes how rules are defined: old: `aws_security_group_rule.ingress_80`, new: `aws_security_group_rule.ingress["http"]`. How do you migrate state without losing rules?

Use state manipulation with careful validation: 1) Export current state: `terraform state pull > before.json`. Keep as backup. 2) Extract rules: `jq '.resources[] | select(.type == "aws_security_group_rule")' before.json > rules.json`. 3) Plan transition: old index-based (ingress_80, ingress_443) to new map-based ([http], [https]). 4) In HCL, change from `resource "aws_security_group_rule" "ingress_80" { ... }` to `resource "aws_security_group_rule" "ingress" { for_each = var.ingress_ports ... port = each.value ... }`. 5) Create mapping in variables: `variable "ingress_ports" { default = { http = 80 https = 443 } }`. 6) Move individual resources: `terraform state mv aws_security_group_rule.ingress_80 'aws_security_group_rule.ingress["http"]'` (with quotes). 7) Run `terraform plan` to verify zero changes. 8) If any rules lost, restore from backup and retry.

Follow-up: How would you automate this refactoring for 15 different security groups?

A junior engineer accidentally imported the same RDS instance twice with different resource names: `aws_db_instance.prod_db` and `aws_db_instance.prod_db_dup`. State now has conflicting references. How do you fix this?

Use state removal and re-import to fix: 1) Identify the issue: `terraform state list | grep db` shows duplicates. 2) Check which is authoritative: `terraform state show aws_db_instance.prod_db` vs `aws_db_instance.prod_db_dup`. 3) Compare attributes: both point to same AWS resource ID. 4) Remove duplicate: `terraform state rm aws_db_instance.prod_db_dup`. 5) Run `terraform plan` to verify only one reference remains and no changes planned. 6) If incorrect one was removed, restore: `terraform state push before.json` with backup, then remove the other. 7) Verify: `aws rds describe-db-instances` shows one instance, state shows one resource. 8) Prevent future: add pre-commit hook or CI check that detects duplicate resource addresses: `terraform state list | sort | uniq -d` alerts on duplicates.

Follow-up: How would you automate detection of duplicate imports across your entire state?

You need to split a large monolithic state file into smaller logical states: all networking into one, all compute into another, all data into third. Resources have cross-references. How do you safely refactor?

Use terraform_remote_state for decoupling: 1) Create three new root modules: `terraform/networking/`, `terraform/compute/`, `terraform/data/`. 2) Move networking resources: for each resource, `terraform state mv -state-out=../networking/terraform.tfstate resource.name resource.name`. 3) Move compute resources similarly. Move data resources similarly. 4) For cross-references: use data sources. Example: in compute module, instead of referencing `aws_vpc.main` from networking, use `data "terraform_remote_state" "networking" { backend = "s3" config = { bucket = "tf-state" key = "networking/terraform.tfstate" } }` then `data.terraform_remote_state.networking.outputs.vpc_id`. 5) Export outputs from each state: networking exports `output "vpc_id"`, compute imports and uses it. 6) Test each module independently: `cd terraform/networking && terraform plan`. 7) Test integration: all three modules deployed together. 8) Run this sequentially with careful validation at each step.

Follow-up: How would you handle circular dependencies between split state files?

You accidentally run `terraform destroy` and delete 30 resources from state. The infrastructure still exists in AWS. You have the previous state file from 2 hours ago. How do you recover?

Use state push to restore: 1) Stop everything. Don't run terraform commands. 2) Restore from backup: `terraform state push old-backup.tfstate`. This overwrites current empty state. 3) Run `terraform plan -refresh-only` to validate state matches AWS reality. Should show zero changes. 4) If validation fails: something in backup doesn't match AWS reality. Debug with `terraform state show resource | grep attribute` vs `aws describe-resources`. 5) Once validated, resume normal operations. 6) Preventive: add `lifecycle { prevent_destroy = true }` on critical resources. 7) Backup automation: daily state backups to S3: `terraform state pull | aws s3 cp - s3://state-backups/$(date +%Y%m%d).json`. 8) Set up CloudTrail alerts on state deletion. 9) Test disaster recovery quarterly: intentionally delete state, verify recovery process works.

Follow-up: How would you verify the restored state is truly correct before resuming operations?

A data structure in state changed format due to provider version update. The state file now has 100 resources with mismatched schema. Terraform refuses to work. How do you fix this state corruption?

Use state edit with schema migration: 1) Extract state: `terraform state pull > corrupted.json`. 2) Identify schema mismatch: `terraform validate` error shows which resource has wrong schema. 3) Compare with provider documentation: understand what fields changed. 4) Manually edit JSON to fix schema: if old schema had `attr = { nested = "value" }` but new expects `attr = "flattened"`, update JSON accordingly. 5) Validate JSON syntax: `jq . corrupted.json` to ensure valid JSON. 6) Push corrected state: `terraform state push corrected.json`. 7) Run `terraform plan` to validate. If still errors, repeat steps. 8) If editing complex: revert provider version temporarily, export state in old format, import into new provider version. 9) Test in non-prod first. Document the fix for similar future issues.

Follow-up: How would you automate schema validation to catch these before they break state?

You're consolidating two company infrastructures after acquisition. Each has separate Terraform states with overlapping resources (same VPC, same databases, different names). Merging them would create conflicts. How do you unify?

Use careful import strategy for acquisition consolidation: 1) Audit both states: list all resources in both, identify overlaps. 2) For overlapping resources: decide which is authoritative. 3) Import into new canonical state: create unified root module with `module "legacy_app_a"` and `module "legacy_app_b"`. 4) In legacy_app_a module, use `terraform import aws_vpc.shared vpc-abc123` to claim ownership. 5) In legacy_app_b, use data source instead: `data "aws_vpc" "shared" { id = var.shared_vpc_id }` so legacy_b doesn't manage it. 6) Migrate code: refactor both HCLs to use new module structure with shared and unique resources. 7) Use state push to merge: for each resource in legacy B that's unique, add to new state. 8) Validate comprehensively: `terraform plan` must show zero changes. 9) Run governance: Sentinel policies ensure no resource conflicts. Schedule coordinated cutover with both teams.

Follow-up: How would you handle this merge if the two companies still need to operate independently post-acquisition?