Terraform Interview — State Management, Backend, and Locking

You inherit a Terraform project where two team members simultaneously run `terraform apply` against the same S3 backend. State corruption occurs. The team now uses local state on their machines. How do you migrate to remote state with locking without downtime?

Set up S3 backend with DynamoDB locking: Create S3 bucket with versioning enabled and DynamoDB table with `LockID` as primary key. Configure backend block: `backend "s3" { bucket = "tf-state" key = "prod/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" encrypt = true }`. Run `terraform init` to detect local state, then `terraform init -migrate-state` to upload to S3. Verify with `terraform state list`. Lock table prevents concurrent applies; versioning enables rollback.

Follow-up: How would you implement state locking for a team of 15 engineers across 3 regions without creating per-region backends?

A `terraform apply` hangs, and you find a stale lock entry in DynamoDB from a crashed CI/CD process 6 hours ago. The team waits. What's your approach to safely unlock state and resume operations?

Use `terraform force-unlock` with the lock ID: `terraform force-unlock 123456789-abcdef`. Before doing this, verify the lock is truly stale by checking CloudWatch Logs for the CI/CD job. Check DynamoDB: `aws dynamodb get-item --table-name terraform-locks --key '{"LockID":{"S":"prod/terraform.tfstate"}}'`. If stale, acquire lock with `terraform apply` in a safe environment first to ensure legitimacy. Implement automatic lock cleanup via Lambda that checks heartbeat age and removes locks older than 2 hours.

Follow-up: How do you prevent this scenario and add observability to detect stale locks automatically?

Your infrastructure spans dev, staging, and production. Currently all use the same backend bucket but different keys. An engineer deletes the wrong key in dev, affecting the state. How do you isolate backends by environment?

Use separate backends per environment with isolated AWS accounts or S3 buckets. Implement with Terraform workspaces or separate root modules: `backend "s3" { bucket = "tf-state-${var.environment}" key = "${var.environment}/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks-${var.environment}" }`. Use backend config files: `terraform init -backend-config="./backends/${ENVIRONMENT}.hcl"`. Add IAM policies limiting access: dev engineers can only access dev state, not production. Store backend configs in separate directories: `terraform/dev/`, `terraform/prod/`.

Follow-up: What monitoring and audit logging would you add to detect unauthorized state access attempts?

You need to migrate state from local backend to Terraform Cloud. Your state file is 150MB with 2000+ resources. How do you handle this without breaking production?

Use `terraform state push` after configuring Terraform Cloud backend. Configure cloud block: `cloud { organization = "my-org" workspaces { name = "production" } }`. Test migration in non-prod workspace first. Run `terraform plan -lock=false` to validate remote state before applying. Use `terraform state replace-provider` if provider paths differ. For large states, set `skip_credentials_validation = true` initially, then remove after validation. Monitor Terraform Cloud workspace for successful import. Verify with `terraform state list | wc -l` and compare with original state.

Follow-up: How would you handle a migration where remote state validation fails midway through production apply?

Your organization requires state encryption at rest, audit logging of all state access, and automatic backup. Current setup uses unencrypted S3. What's your migration strategy?

Implement comprehensive security: 1) Enable S3 encryption: `ServerSideEncryptionConfiguration { RuleList: [{ ApplyServerSideEncryptionByDefault: { SSEAlgorithm: "aws:kms" } }] }`. 2) Enable S3 versioning and MFA delete. 3) Configure CloudTrail for s3:GetObject and s3:PutObject events. 4) Set up automated backups using S3 lifecycle policies and cross-region replication. 5) Use `block_public_access = true` and bucket policies denying unencrypted uploads. 6) Add S3 Access Logging to separate bucket. 7) Implement `terraform state pull > backup.tfstate` in daily Lambda.

Follow-up: How would you audit which team member modified a specific resource in the state, and when?

You're implementing a disaster recovery plan. You need to replicate state to a secondary region automatically, verify integrity hourly, and be able to failover within 5 minutes. Design this system.

Use S3 cross-region replication (CRR) for primary bucket to secondary region with same-region DynamoDB backups. Implement: 1) S3 CRR with replication time control (RTC) SLA of 15 minutes. 2) Lambda function triggered every hour to validate state integrity: fetch state from both regions, compute checksums, alert on mismatch. 3) Secondary DynamoDB table in standby with global tables or on-demand provisioning. 4) DNS failover using Route53 health checks on read-only state endpoint. 5) Document failover: update backend config to secondary bucket, run `terraform init` with `-reconfigure` flag. 6) Test quarterly: run `terraform plan` against secondary state in isolated environment.

Follow-up: How would you ensure the failover state is consistent with the infrastructure actually deployed?

A developer runs `terraform destroy` in the wrong workspace and deletes production database state. You have the state file from 15 minutes ago in CloudTrail but don't know the exact HCL that created it. How do you recover?

Recover using state restore procedure: 1) Retrieve previous state from S3 versioning: `aws s3api get-object --bucket tf-state --key prod/terraform.tfstate --version-id ABC123 previous.tfstate`. 2) Validate state: `terraform state list -state=previous.tfstate`. 3) Push restore: `terraform state push previous.tfstate`. 4) Verify without applying: `terraform plan -out=restore.plan` to see proposed changes. 5) Review plan carefully for resource recreation. 6) Apply only if safe: `terraform apply restore.plan`. 7) For databases: restore from RDS snapshots before running destroy. Add safeguards: `lifecycle { prevent_destroy = true }` on critical resources, require approval before destroy in CI/CD.

Follow-up: How would you automate detection and alert when someone attempts a destroy on production state?

Your Terraform state contains 150 resources but only 50 are still defined in HCL. Running `terraform plan` shows massive deletions. You need to identify which resources are safe to remove from state without affecting running infrastructure. What's your process?

Use state inspection and infrastructure validation: 1) Export state: `terraform state pull > full.tfstate`. 2) Compare against HCL: `terraform state list` vs actual resource definitions. 3) For each orphaned resource, run `terraform state rm` individually after validation: `terraform state rm module.old.aws_instance.example`. 4) Before removing, verify resource still exists in AWS: `aws ec2 describe-instances --instance-ids i-12345`. 5) Only remove if infrastructure is properly managed externally or deprecated. 6) Use `terraform import` if the resource should be managed: `terraform import aws_instance.recovered i-12345`. 7) Run `terraform plan` after each removal to verify no unexpected changes. Keep removed resources documented for audit trail.

Follow-up: How would you prevent this state drift in the first place with automation?