Terraform Interview — Cross-Account and Multi-Region Patterns

Your company has 5 AWS accounts (dev, staging, prod, security, audit). A single Terraform deployment needs resources in all accounts. Current approach: run terraform 5 times with different credentials. This is error-prone and doesn't ensure consistency. Design a unified deployment approach.

Use provider aliases for cross-account management: 1) Define providers: `provider "aws" { alias = "dev" assume_role { role_arn = "arn:aws:iam::111111111111:role/TerraformRole" } } provider "aws" { alias = "prod" assume_role { role_arn = "arn:aws:iam::222222222222:role/TerraformRole" } }` for each account. 2) Use providers in resources: `resource "aws_s3_bucket" "dev_logs" { provider = aws.dev bucket = "logs-dev" } resource "aws_s3_bucket" "prod_logs" { provider = aws.prod bucket = "logs-prod" }`. 3) Single apply creates resources in all accounts atomically. 4) Use modules for reusability: module logging { providers = { aws = aws.dev } } then module logging_prod { providers = { aws = aws.prod } }`. 5) Backend state: keep in central account, encrypt. 6) Deploy: `terraform apply` provisions all 5 accounts. 7) Validate: run `terraform plan` to verify changes across all accounts before applying.

Follow-up: How would you handle authentication failures in one account without affecting others?

You need to deploy the same infrastructure to 6 AWS regions (us-east-1, us-west-2, eu-west-1, etc.). Replicating HCL 6 times creates maintenance burden. Design multi-region approach.

Use provider aliases and for_each for regions: 1) Define providers for each region: `provider "aws" { alias = "us-east-1" region = "us-east-1" } provider "aws" { alias = "us-west-2" region = "us-west-2" }` etc. 2) Or dynamically: `variable "regions" { default = ["us-east-1", "us-west-2", "eu-west-1"] } provider "aws" { for_each = toset(var.regions) alias = each.value region = each.value }`. 3) Create resources: `resource "aws_vpc" "main" { for_each = toset(var.regions) provider = aws[each.value] cidr_block = "10.${index(var.regions, each.value)}.0.0/16" }`. 4) Use modules: `module "region" { for_each = toset(var.regions) source = "./modules/region" providers = { aws = aws[each.value] } }`. 5) State: single backend tracks all regions. 6) Regional data: use `var.region_config` map to customize per region: `module "region" { ... region_config = var.region_config[each.value] }`. 7) Deploy: `terraform apply` provisions all regions simultaneously.

Follow-up: How would you handle region-specific resources that don't exist in all regions?

You've deployed infrastructure to 6 regions via single Terraform config. A bug in one region's security group needs fixing quickly. Fixing HCL and applying globally is too risky. How do you target fix to one region?

Use targeted apply: 1) Fix HCL for affected region. 2) Apply to single region: `terraform apply -target='aws_security_group.main["us-west-2"]'` to fix only US-West. 3) Or for module: `terraform apply -target=module.region[\"us-west-2\"]` applies only that region's module. 4) Verify first: `terraform plan -target=module.region[\"us-west-2\"]` to see what changes in that region. 5) For emergency: use -auto-approve: `terraform apply -target=... -auto-approve` skips confirmation. 6) For safety: even targeted applies should have approval gate in CI/CD. 7) Document: comment in HCL why region-specific fix, when it should be merged to other regions. 8) Test: before rolling out to other regions, verify fix works in test region. 9) Gradual rollout: apply fix to one region, monitor, then repeat for next region.

Follow-up: How would you automate gradual region rollout with validation between each?

Cross-account infrastructure creates cross-account relationships: VPC peering from prod account to audit account, S3 bucket in prod accessible by role in audit account. How do you manage these relationships?

Define relationships as bidirectional resources: 1) In prod account: create VPC peering initiator: `resource "aws_vpc_peering_connection" "to-audit" { provider = aws.prod vpc_id = aws_vpc.prod.id peer_vpc_id = aws_vpc.audit.id peer_owner_id = data.aws_caller_identity.audit.account_id }`. 2) In audit account: accept: `resource "aws_vpc_peering_connection_accepter" "from-prod" { provider = aws.audit vpc_peering_connection_id = aws_vpc_peering_connection.to-audit.id }`. 3) For S3 bucket policies: prod bucket policy grants audit role permission: `resource "aws_s3_bucket_policy" "prod_bucket" { bucket = aws_s3_bucket.prod.id policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.audit.account_id}:role/AuditRole" } Action = "s3:GetObject" Resource = "${aws_s3_bucket.prod.arn}/*" }] }) }`. 4) Single apply creates both sides: both bucket policy and peering acceptance happen together. 5) For dependencies: use `depends_on` if Terraform can't infer order. 6) Document: show which account creates which resource, why.

Follow-up: How would you handle account-to-account trust policies safely to prevent over-permissioning?

Your multi-region state file grows to 2000+ resources. Terraform plan takes 30 minutes because it refreshes all regions. Adding new region takes 45 minutes for first plan. How do you optimize refresh performance?

Optimize refresh and state operations: 1) Use `terraform plan -refresh=false` to skip refresh and just check HCL: `terraform plan -refresh=false -out=plan.tfplan`. Use when you know AWS state hasn't changed. 2) Selective refresh: `terraform refresh -target=aws_vpc.main` refreshes only specific resource. For multi-region, `terraform refresh -target='aws_vpc.main["us-east-1"]'`. 3) Target apply: after planning, apply only to changed regions: `terraform apply -target=module.region[\"eu-west-1\"] plan.tfplan`. 4) State parallelism: increase workers: `terraform apply -parallelism=50 plan.tfplan` refreshes 50 regions simultaneously (default 10). 5) Split state files: instead of single state for all regions, separate `terraform/us-east/`, `terraform/us-west/`, etc. Each smaller state refreshes faster. Use `terraform_remote_state` to reference cross-state. 6) Cache: `terraform providers lock` once, reuse. 7) Monitor: add timing logs to identify slowest regions.

Follow-up: How would you handle cross-region data dependencies with split state files?

You're deploying infrastructure to 3 AWS accounts and 2 regions (prod-us-east, prod-eu, dev-us-east). Different environments need different CIDR blocks, instance types, etc. How do you manage this combinatorial explosion of configurations?

Use environment matrices and variable overrides: 1) Define environments: `variable "environments" { default = { prod-us-east = { account_id = "111", region = "us-east-1", cidr = "10.0.0.0/16", instance_type = "t3.large" } prod-eu = { account_id = "222", region = "eu-west-1", cidr = "10.1.0.0/16", instance_type = "t3.large" } dev-us-east = { account_id = "333", region = "us-east-1", cidr = "10.2.0.0/16", instance_type = "t3.micro" } } }`. 2) Loop over matrix: `module "env" { for_each = var.environments source = "./modules/env" environment_name = each.key environment_config = each.value providers = { aws = aws[each.value.region] } }`. 3) Each module receives its config. 4) Validate with terraform console: test config lookups before applying. 5) Keep configs in separate files: `environments.tfvars` or `environments.yaml` loaded via `jsondecode(file(...))`. 6) Use local: `locals { current_env = var.environments[var.target_env] }` to enable targeting specific env. 7) Deploy: `terraform apply -target=module.env[\"prod-us-east\"]` for single environment or `terraform apply` for all.

Follow-up: How would you detect configuration conflicts (e.g., CIDR overlaps) across environments?

A production incident occurs in one region: network is misconfigured. You need to rollback that region to previous state while other regions continue operating. Design rollback strategy.

Use region-specific state recovery: 1) Identify affected region: determine which region has issue. 2) Have state backup: daily backups of entire state file to S3: `aws s3 cp terraform.tfstate s3://backups/$(date +%Y%m%d).tfstate`. 3) Restore region: retrieve backup, extract region-specific state: `terraform state pull | jq '.resources[] | select(.instances[].attributes.region == "us-west-2")' > region-backup.json`. 4) Restore into current state: carefully merge backup region state with current state: `terraform state replace-provider aws.us-west-2 aws.us-west-2 < region-backup.json`. 5) Validate: run `terraform plan -target=module.region[\"us-west-2\"]` to ensure state matches AWS. 6) If state doesn't match: manually fix by importing correct resources. 7) For faster recovery: use separate state files per region. Rollback single region's state file: `terraform state push region-backup.json -state=us-west-2.tfstate`. 8) Document: what time rollback was needed, what was restored, why incident occurred.

Follow-up: How would you prevent the issue that caused the rollback in the first place?

You have data stored in prod account S3 bucket. A batch job in dev account needs to read this data. Cross-account access must be minimal (principle of least privilege) and auditable. Design secure access.

Use fine-grained IAM policies with audit: 1) In prod account, create S3 bucket policy granting dev account access: `resource "aws_s3_bucket_policy" "prod_data" { provider = aws.prod bucket = aws_s3_bucket.data.id policy = jsonencode({ Version = "2012-10-17" Statement = [{ Sid = "AllowDevRead" Effect = "Allow" Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.dev.account_id}:role/BatchJobRole" } Action = ["s3:GetObject", "s3:ListBucket"] Resource = [aws_s3_bucket.data.arn, "${aws_s3_bucket.data.arn}/*"] }] }) }`. 2) In dev account, create role: `resource "aws_iam_role" "batch_job" { provider = aws.dev assume_role_policy = jsonencode({ ... allow ecs task role ... }) }`. 3) Attach policy: `resource "aws_iam_role_policy" "batch_access" { role = aws_iam_role.batch_job.id policy = jsonencode({ Effect = "Allow" Action = "s3:GetObject" Resource = "${data.aws_s3_bucket.prod_data.arn}/*" }) }`. 4) Audit: enable CloudTrail in prod to log all cross-account S3 access. 5) Monitor: set up CloudWatch alert if dev account access increases anomalously. 6) Rotate: if dev team changes, update IAM policies immediately. 7) Test: verify dev account can read but not write/delete.

Follow-up: How would you detect and alert on unauthorized cross-account access attempts?