You have 500 resources across 50 modules. Application servers depend on databases, databases depend on VPC, etc. Running `terraform apply` takes forever because it's creating resources sequentially instead of in parallel. How do you optimize the dependency graph?
Identify and eliminate unnecessary dependencies: 1) Visualize graph: `terraform graph | dot -Tsvg > graph.svg`. Look for long chains where each resource waits on previous. 2) Identify implicit dependencies: some resources reference each other through variable chains when they could be independent. Example: two security groups not actually dependent on each other but both reference VPC. Split: `resource "aws_security_group" "app" { vpc_id = aws_vpc.main.id }` and `resource "aws_security_group" "db" { vpc_id = aws_vpc.main.id }` run in parallel (both depend on VPC, but not on each other). 3) Use `depends_on` explicitly only when necessary: `resource "aws_instance" "main" { depends_on = [aws_security_group.main] }`. 4) Remove implicit dependencies by refactoring: move shared data sources to separate module, have both consume it. 5) Run `terraform apply -parallelism=50` to increase parallel workers (default is 10). 6) Monitor: `terraform apply -parallelism=50 | grep Apply` shows what's running in parallel.
Follow-up: How would you detect if adding parallelism causes race conditions or state corruption?
Two resources have complex interdependencies: SecurityGroup depends on VPC, VPC depends on Route tables, Route tables depend on SecurityGroup indirectly through application logic. Terraform can't resolve the cycle. How do you break it?
Refactor to eliminate circular dependency: 1) Identify the cycle: `terraform graph | grep -A2 -B2 "circular"`. 2) Typical causes: resource A outputs used by B, B outputs used by A. 3) Solution: separate into stages. Stage 1 creates foundational resources (VPC, subnets). Stage 2 creates dependent resources (SGs). 4) Example: `resource "aws_vpc" "main" { ... }` (stage 1). `resource "aws_security_group" "app" { vpc_id = aws_vpc.main.id }` (stage 2). Split into `terraform/networking/` (VPC, subnets) and `terraform/security/` (SGs) if too complex. 5) Use `depends_on` explicitly for one direction only: `depends_on = [aws_vpc.main]` on SG, don't reference SG in VPC. 6) Use data sources to break cycles: instead of VPC outputting SGs, VPC outputs VPC ID. SG data source queries VPC ID from output. 7) Validate: `terraform validate` should pass. If not, graph still has cycle.
Follow-up: How would you detect circular dependencies automatically in CI/CD?
You use lifecycle hook `prevent_destroy = true` to protect production databases. But now you need to replace the instance due to maintenance. How do you safely bypass this protection?
Temporarily disable prevent_destroy for planned replacement: 1) Create maintenance window. 2) Modify HCL: remove or comment out `lifecycle { prevent_destroy = true }`. 3) Run `terraform plan` to see replacement strategy. For databases, add `create_before_destroy = true` to minimize downtime. 4) Apply: `terraform apply` recreates resource. 5) For stateful resources (databases), data is preserved if RDS snapshot exists. 6) Validate: application continues working (uses read replica, cached connection). 7) Re-enable protection: uncomment `prevent_destroy = true`. 8) Commit to git: document why maintenance was done, when. 9) Automate: instead of manually editing, use variable: `variable "allow_replace" { default = false }` then `lifecycle { prevent_destroy = !var.allow_replace }` and `terraform apply -var allow_replace=true`. 10) Alert: log all replacements of protected resources. On-call reviews each.
Follow-up: How would you handle this scenario if application has zero-downtime deployment?
You want to replace an EC2 instance (create new, delete old) but during the replacement window, the ALB target group briefly loses the instance causing dropped requests. Design zero-downtime replacement.
Use `create_before_destroy` with health checks: 1) Enable `create_before_destroy`: `resource "aws_instance" "app" { lifecycle { create_before_destroy = true } }`. 2) Terraform creates new instance first, then deregisters from target group, then destroys old. 3) Add health check delays: `resource "aws_lb_target_group_attachment" "app" { target_id = aws_instance.app.id } resource "aws_lb_target_group" { health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 interval = 30 } }`. 4) During replacement: new instance joins target group, ALB routes traffic to it after 2 successful health checks (~60 seconds). Old instance still receives traffic during ramp-up. 5) Connection draining: set `deregistration_delay = 30` on target group so existing connections complete. 6) Monitor: watch ALB metrics during replacement. If request latency spikes, investigate health checks. 7) Test: do dry-run in non-prod to verify zero-downtime behavior. 8) Document: show that replacement takes ~2 minutes (health checks + draining).
Follow-up: How would you handle this for a database instance that can't be duplicated?
You use `ignore_changes` to allow manual modifications to an auto-scaling group's desired_capacity. But Terraform constantly wants to reset it. Design lifecycle rules to allow manual scaling.
Use selective ignore_changes: 1) Add lifecycle: `resource "aws_autoscaling_group" "main" { lifecycle { ignore_changes = [desired_capacity] } }`. 2) This allows manual scaling via AWS Console or API without Terraform fighting it. 3) Terraform still manages everything else (min_size, max_size, launch_template). 4) For permanent capacity changes: update HCL `desired_capacity = 10`. 5) Terraform apply won't change it (due to ignore_changes), but git shows intent. 6) Document: show team that desired_capacity should be tuned in AWS Console during incidents, then HCL should be updated for permanent changes. 7) Prevent surprises: add comment in HCL: `# NOTE: desired_capacity ignored to allow manual scaling. Update HCL after incident.`. 8) Monitoring: set up CloudWatch alarm if desired_capacity doesn't match expected for > 24 hours, alert team. 9) Periodic review: script checks if ignored values differ from HCL intent, flags for review.
Follow-up: How would you enforce that manual scaling changes are documented in a ticket?
You have a Lambda function that needs to be recreated when environment variables change, but Terraform doesn't detect these as drift. Using `triggers` could cause unnecessary recreations. Design smart update logic.
Use `triggers` for selective recreation: 1) Add to resource: `resource "aws_lambda_function" "worker" { lifecycle { create_before_destroy = true } triggers = { env_vars = jsonencode(var.environment_variables) } }`. 2) Terraform recreates Lambda only when `triggers` value changes. 3) This detects environment variable changes (which don't normally trigger updates). 4) For hash-based triggers: `triggers = { config_hash = md5(jsonencode({env = var.env vars = var.vars})) }`. 5) Customize what triggers recreation: if you only care about certain env vars, include only those: `triggers = { db_password = var.db_password database_url = var.db_url }`. 6) Combine with file changes: `triggers = { code_hash = filemd5("index.py") env_hash = jsonencode(var.env) }`. 7) Monitor: log all Lambda recreations. If triggering too often, investigate. 8) Test: verify Lambda doesn't lose state during recreation (should be stateless). 9) Document: show team which changes trigger Lambda rebuild vs in-place update.
Follow-up: How would you prevent Lambda recreation during harmless changes (e.g., adding unused environment variable)?
During `terraform apply`, an RDS database is being replaced (old deleted, new created). Halfway through, the provider crashes. State is half-updated. Running apply again recreates what already exists. How do you design idempotent lifecycle operations?
Implement idempotent operations with state tracking: 1) Use `create_before_destroy = true` so new RDS created before old deleted. If crash mid-way, state shows both. Re-running apply deletes old (as intended), new already exists so skips creation. 2) Implement step functions: instead of one big replace, split into stages with checkpoints. Stage 1: create new DB. Verify with health check. Only after success: Stage 2 delete old DB. 3) Add manual checkpoint: before destroying old DB, require approval: `lifecycle { create_before_destroy = true prevent_destroy = true }` on old resource, then manually remove `prevent_destroy` after verifying new DB works. 4) Use state locking: ensure no two applies run simultaneously. 5) Implement retry logic: `terraform apply` is idempotent - rerunning applies from crash point. Terraform only changes what's different. 6) Add observability: log each lifecycle action (create, update, delete) with timestamp. If crash, logs show exactly where failed. 7) Test: intentionally crash provider mid-apply (kill process) and verify rerun succeeds.
Follow-up: How would you automate recovery from mid-operation crashes?
You want to run custom scripts before destroying a resource (backup database, notify monitoring system). After creating a resource, run additional setup (install agent, configure). Design pre/post lifecycle hooks.
Use `local-exec` provisioner for lifecycle hooks: 1) Before destroy: `resource "aws_db_instance" "main" { provisioner "local-exec" { when = destroy command = "bash backup.sh ${self.db_name}" } }`. This runs backup script before RDS deleted. 2) After create: `provisioner "local-exec" { when = create command = "bash setup.sh ${self.private_ip}" }` installs agent post-creation. 3) Store output: redirect to file for audit: `command = "bash setup.sh ... > /var/log/setup.log 2>&1"`. 4) Error handling: provisioner fails if script exits non-zero, preventing resource destruction. Use `on_failure = continue` to ignore errors and destroy anyway. 5) Avoid provisioners when possible: prefer user_data, cloud-init, or configuration management (Ansible). Provisioners are last resort. 6) For databases: use RDS event subscriptions instead of provisioners for backup triggers. 7) Document: show why provisioners needed and what they do. 8) Test: verify provisioners work by running `terraform apply -destroy` in non-prod.
Follow-up: How would you handle a situation where pre-destroy script fails but you still need to destroy the resource?