Terraform Interview — Performance and Large State Optimization

Your Terraform state file has 3000+ resources. `terraform plan` takes 45 minutes because it refreshes all resources against AWS API (rate limited). Developers are blocked waiting for feedback. How do you optimize?

Implement performance optimizations: 1) Disable refresh on plan: `terraform plan -refresh=false` takes 30 seconds (only checks HCL). Use for rapid feedback. Before critical applies, run with refresh. 2) Selective refresh: `terraform refresh -target=module.compute` refreshes only compute module's resources. Faster than full. 3) Refresh interval: configure in backend to cache refreshes. But TF doesn't have built-in cache - workaround: track last refresh time in local file. 4) Increase parallelism: `terraform plan -parallelism=50` refreshes 50 resources simultaneously (default 10). With 3000 resources, 50x parallelism reduces time dramatically. 5) Split state: instead of 3000 in one state, split into 10 modules with 300 resources each. Each refreshes faster. Use `terraform_remote_state` to reference. 6) Lazy load: use `data` sources for resources that don't change often. They're read-only, don't refresh. 7) Target testing: `terraform plan -target=aws_instance.web` tests specific resource quickly. 8) Monitor: add timing to plan. `time terraform plan` shows elapsed time. Track if degrading.

Follow-up: How would you detect when state size or refresh time becomes problematic?

You split Terraform state into 10 modules (networking, compute, data, etc.). Now a resource in module A depends on output from module B (VPC from networking, security group from compute). Cross-module references cause dependency issues. Design clean module dependencies.

Use terraform_remote_state for module references: 1) Module A (networking): `output "vpc_id" { value = aws_vpc.main.id }`. 2) Module B (compute) references A: `data "terraform_remote_state" "network" { backend = "s3" config = { bucket = "tf-state" key = "networking/terraform.tfstate" region = "us-east-1" } }`. 3) Use in resources: `resource "aws_security_group" "app" { vpc_id = data.terraform_remote_state.network.outputs.vpc_id }`. 4) Deployment order: terraform apply networking first, then compute (depends on it). CI/CD should enforce this: stage 1 deploy networking, stage 2 deploy compute. 5) Prevent circular deps: Ensure A can't reference B if B references A. Document module hierarchy. 6) For safety: add `depends_on = [null_resource.networking_applied]` to compute module's root resource to enforce ordering. 7) Alias outputs: export only necessary outputs from each module. Don't export internal details. 8) Version modules: if networking module changes outputs, compute module must handle gracefully.

Follow-up: How would you handle transient AWS API failures during cross-module refresh?

Your Terraform state file is 200MB with 5000 resources. It's becoming slow to manage and hard to edit manually if needed. You want to rotate old resources into archive state. Design state archival strategy.

Implement state lifecycle management: 1) Archive old resources: if resource hasn't changed in 1 year, move to archive state. `terraform state rm aws_instance.deprecated` removes from active state. Backup first: `terraform state pull > archive-2024.json`. 2) Move resources: use `terraform state mv` to move old resources to archive module. `terraform state mv aws_instance.old module.archive.aws_instance.old`. 3) Archive module: create `modules/archive/` that manages deprecated resources. Keep separate state or keep in active with comments. 4) Separate backends: active resources in `prod.tfstate` (1000 resources), archive in `archive-2024.tfstate` (4000 resources). 5) Reference archive: `data "terraform_remote_state" "archive" { backend = "s3" config = { bucket = "tf-state" key = "archive-2024.tfstate" } }` if archive resources need references. 6) Cleanup: after archival confirmed (6 months no issues), completely remove: `aws s3 rm s3://tf-state/archive-2024.tfstate`. 7) Monitor state size: alert if state > 500MB. Size = potential performance issue. 8) Retention policy: delete archives after 3 years for compliance.

Consider state splitting by functionality instead: 1) Instead of time-based archival, split by domain: `prod-network.tfstate` (networking only), `prod-compute.tfstate` (EC2, ASG), `prod-data.tfstate` (RDS, DynamoDB). 2) Each state is smaller, refreshes faster. 3) Teams work independently: network team manages network state, compute team manages compute state. 4) Cross-references via remote state data sources. 5) Deployment: deploy networks first, then compute (has dep). 6) Benefit: 3-5 smaller states are easier to manage than 1 huge state. 7) Drawback: more coordination between teams. 8) Use this when team size > 5 or state > 1000 resources.

Follow-up: How would you ensure archive state remains immutable for audit purposes?

You're using `count` to provision 100 EC2 instances dynamically based on list size. Changing list order (or removing one from middle) causes all instances after that to be destroyed and recreated. This is disruptive. How do you fix?

Use `for_each` with stable keys instead of `count`: 1) Old (bad): `variable "instance_names" { default = ["web-1", "web-2", "web-3"] } resource "aws_instance" "main" { count = length(var.instance_names) tags = { Name = var.instance_names[count.index] } }`. If list changes order or size, instances rebuild. 2) New (good): `variable "instances" { default = { web-1 = { instance_type = "t3.medium" } web-2 = { instance_type = "t3.medium" } web-3 = { instance_type = "t3.large" } } } resource "aws_instance" "main" { for_each = var.instances instance_type = each.value.instance_type tags = { Name = each.key } }`. 3) Now: removing `web-1` only destroys web-1, not web-2 and web-3. 4) Reordering doesn't affect resources. 5) Migration: use `terraform state mv` to remap: `terraform state mv 'aws_instance.main[0]' 'aws_instance.main["web-1"]'` for each instance. 6) Validate after migration: `terraform plan` should show zero changes. 7) Document: show team to always use for_each with stable keys for variable-sized deployments.

Follow-up: How would you automate detection of problematic count usage in legacy code?

During `terraform apply`, a provider API call hangs (no response for 10 minutes). Terraform waits indefinitely. You can't cancel gracefully (kill process loses state lock). How do you design timeout and recovery?

Implement timeout and retry mechanisms: 1) Set provider timeouts: `provider "aws" { max_retries = 5 assume_role_with_web_identity_token_expires_in = 900 }`. Some providers support timeouts. 2) Use external timeout wrapper: `timeout 20m terraform apply` - if terraform doesn't finish in 20 minutes, kill process. 3) Implement graceful shutdown: on timeout, release lock: `trap 'terraform force-unlock ' EXIT`. 4) State recovery: after kill, state is inconsistent. Run `terraform refresh` to sync state with AWS. 5) Identify stuck resource: logs show last resource attempted. Run `terraform plan -target=aws_instance.stuck` to check if creation succeeded despite hanging. 6) Retry with targeting: if provider hung on resource 50/100, apply only that: `terraform apply -target=aws_instance.50`. 7) Escalate to AWS support: if AWS API is hanging, investigate on AWS side. 8) Circuit breaker: if API latency > 30 seconds, cancel apply and alert. Prevents hanging indefinitely. 9) Monitor: track provider API response times. Alert if > 5 seconds.

Follow-up: How would you prevent provider hangs from causing cascading failures in CI/CD?

Your Terraform deploys 500 resources. Most are independent, but 50 have complex dependencies. `terraform apply` serializes everything, even independent resources. You want parallelism but dependency graphs complicate it. How do you optimize?

Use explicit depends_on sparingly, leverage implicit deps: 1) Implicit dependencies: if resource A references B (e.g., `vpc_id = aws_vpc.main.id`), Terraform infers dependency automatically. No need for explicit `depends_on`. 2) Explicit only when necessary: `depends_on = [aws_security_group.app]` only if A needs B but doesn't reference B. 3) Increase parallelism: `terraform apply -parallelism=50` runs 50 independent resources simultaneously. With good graph, 500 resources deploy in ~10 stages (500/50 = 10). 4) Validate graph: `terraform graph | dot -Tsvg > graph.svg` visualize dependencies. Look for long chains. 5) Refactor long chains: if 50 resources form chain (A->B->C->...->Z), break into modules. Each module's internal chain is parallelized within module. 6) Use local values for computed values instead of resource deps: `locals { vpc_id = aws_vpc.main.id }` used by many resources creates implicit deps but Terraform parallelizes well. 7) Monitor: add logging to see parallelism. Terraform outputs `[apply]` indicates parallel operation. 8) Benchmark: `time terraform apply` before and after parallelism optimization.

Follow-up: How would you detect if increasing parallelism causes race conditions?

Your organization uses Terraform Cloud with large state. Terraform Cloud API calls to state operations (plan, apply, output) are rate-limited to 30 req/min. When multiple teams apply simultaneously, they hit rate limits and get 429 errors. How do you handle this?

Implement rate limit handling and coordination: 1) TFC rate limiting: 30 requests/minute per organization. Multiple teams hitting limit simultaneously. 2) Solution 1 - Sequential applies: implement queue. Only one apply runs at a time per organization. CI/CD job waits in queue. Tradeoff: slower deploys. 3) Solution 2 - Distributed organizations: create separate TFC orgs per team (auth-org, api-org, data-org). Each org has own rate limit (30 req/min per org). Total capacity = 3 orgs x 30 = 90 req/min. 4) Solution 3 - TFC scale: upgrade to large plan/apply runs that use dedicated runners. Higher rate limits. 5) Solution 4 - Jitter/backoff: implement exponential backoff in CI/CD. If 429 error, wait 60s then retry. Random jitter (0-30s) prevents thundering herd. 6) Monitor: track 429 errors. Alert if > 5% of applies hit rate limit. Indicates need for capacity increase. 7) Batch operations: group small applies into single workspace to reduce API calls. 8) Cache outputs: if multiple teams query same outputs, cache results instead of repeated API calls.

Follow-up: How would you advocate to TFC for higher rate limits?