Terraform Interview — Data Sources and Remote State References

Your VPC was created manually before Terraform adoption. Now Terraform modules need to reference that VPC ID. Recreating the VPC via Terraform is risky. How do you query existing infrastructure without managing it?

Use data sources to query existing resources: 1) Query VPC by tag: `data "aws_vpc" "main" { tags = { Name = "prod-vpc" } }`. 2) Use output: `resource "aws_security_group" "app" { vpc_id = data.aws_vpc.main.id }`. Terraform reads VPC ID without managing it. 3) Data sources vs resources: resources create/manage, data sources read-only. No state management of data source. 4) For subnets: `data "aws_subnets" "private" { filter { name = "tag:Type" values = ["private"] } }` queries all private subnets. 5) For security groups: `data "aws_security_group" "default" { vpc_id = data.aws_vpc.main.id name = "default" }` finds default SG. 6) Advantages: Terraform manages only new resources. Existing resources protected. 7) Danger: if existing VPC deleted manually, Terraform still references it (now broken). Add monitoring: verify referenced resources exist. 8) Documentation: show that this VPC is external (not managed by TF), when it can be depreciated. 9) Migration path: when safe to migrate VPC to Terraform, convert data source to resource: `terraform import aws_vpc.main vpc-12345`.

Follow-up: How would you detect if a referenced data source resource no longer exists?

You have 3 environments (dev, staging, prod) managed in separate Terraform states. App deployment in staging needs the prod database endpoint. How do you reference cross-environment state?

Use terraform_remote_state for cross-environment references: 1) In prod state (managed by `terraform/prod/`): output the database endpoint: `output "rds_endpoint" { value = aws_db_instance.main.endpoint }`. 2) In staging state (managed by `terraform/staging/`): reference prod state: `data "terraform_remote_state" "prod" { backend = "s3" config = { bucket = "tf-state" key = "prod/terraform.tfstate" region = "us-east-1" } }`. 3) Use output: `resource "aws_db_proxy" "prod_replica" { target_endpoint = data.terraform_remote_state.prod.outputs.rds_endpoint }`. 4) This reads prod state file, extracts output, uses in staging resource. 5) Benefit: staging can read prod config without managing prod resources. 6) Deployment order: prod must deploy first (has state file). Staging depends on it (reads state). 7) Backup: if prod state lost, staging breaks. Keep prod state backups. 8) Permissions: separate AWS account per environment. Staging role can read prod state bucket (s3:GetObject) but not write. 9) Monitoring: alert if prod state unavailable. Staging deploy will fail. 10) Documentation: show team which outputs each environment provides, which consume them.

Follow-up: How would you handle circular dependencies between environment states?

You use `terraform_remote_state` for cross-environment references. A team member accidentally destroys prod state file. Staging environment tries to read it and gets error. App deployment fails. How do you prevent and recover?

Protect remote state and handle failures gracefully: 1) Prevention: 1a) State bucket versioning: `aws s3api put-bucket-versioning --bucket tf-state --versioning-configuration Status=Enabled`. 1b) MFA delete: `--mfa-delete Enabled` requires MFA to delete. 1c) Bucket policy denies all deletes. 1d) Enable CloudTrail on state bucket. 1e) Read-only backup: copy state to separate bucket daily (immutable). 2) Staging resilience: handle missing remote state: use `try()` function in Terraform 1.3+: `endpoint = try(data.terraform_remote_state.prod.outputs.rds_endpoint, "default-endpoint")`. Fallback to default. 3) Alerts: CloudWatch alert if state bucket access denied or deleted. 4) Recovery: from backup: `aws s3 cp s3://backups/prod-state-2024-03-15.json s3://tf-state/prod/terraform.tfstate`. Restore to original. 5) State verification: after restore, verify staging can read: `terraform console` and test `data.terraform_remote_state.prod.outputs`. 6) Incident postmortem: investigate how state was deleted. Was it accidental? Is bucket policy working? 7) Communication: notify team that prod state restored, staging safe again. 8) Documentation: show backup/restore procedure in runbook.

Follow-up: How would you prevent accidental state deletion in the first place?

You're querying 50 AMIs (Amazon Machine Images) by filter to get the latest. Writing 50 separate data sources is tedious. How do you query multiple resources efficiently?

Use data source filtering and for_each for bulk queries: 1) Single data source with filter: `data "aws_ami" "latest" { most_recent = true filter { name = "name" values = ["ubuntu-jammy-22.04-*"] } }` gets latest Ubuntu. 2) For multiple AMI types: `locals { ami_filters = { ubuntu = { name = "ubuntu-jammy-*", owner = "099720109477" } amazon-linux = { name = "al2-ami-*", owner = "137112412989" } windows = { name = "Windows_Server-2022-*", owner = "801119661308" } } } data "aws_ami" "main" { for_each = local.ami_filters most_recent = true filter { name = "name" values = [each.value.name] } owners = [each.value.owner] }`. 3) Use output: `locals { ami_ids = { for k, v in data.aws_ami.main : k => v.id } }`. Now have all AMIs keyed by type. 4) Reference: `resource "aws_instance" "ubuntu" { ami = data.aws_ami.main["ubuntu"].id }`. 5) Advanced: query by tag: `data "aws_ami" "by_tag" { filter { name = "tag:Version" values = ["v1.2.0"] } }` finds AMI with specific tag. 6) Cache: Terraform caches data source results during plan/apply. No re-querying within single operation. 7) Test: `terraform console` to verify filters work: `data.aws_ami.main["ubuntu"].id`.

Follow-up: How would you handle if the data source query returns multiple matches when you expect one?

You have 2000-resource Terraform state. Refreshing all data sources to validate they still exist in AWS takes 15 minutes. This slows development. How do you optimize data source refresh?

Optimize data source performance: 1) Lazy load: don't query all data sources on every plan. Query only if resource changed. Use `depends_on` carefully - unnecessary deps cause full refresh. 2) Selective refresh: `terraform refresh -target=data.aws_ami.main` refreshes only that data source. 3) Plan without refresh: `terraform plan -refresh=false` doesn't query any data sources. Use for rapid feedback. Before applying, do full refresh. 4) Cache locally: store data source results in local file: `output "ami_id" { value = data.aws_ami.main.id }`. Subsequent runs reuse from state. (State already caches.) 5) Separate data module: put all data source queries in separate `terraform/data/` directory. Apply only when data changes. App modules read via remote state. 6) Parallelism: `terraform refresh -parallelism=50` refreshes 50 data sources simultaneously. 7) Filter queries: instead of broad `aws_subnets` query, add filters to narrow results: `filter { name = "tag:Environment" values = ["prod"] }`. Faster queries. 8) Monitor: track refresh time per data source. Alert if any > 10 seconds (might indicate AWS API issue).

Follow-up: How would you handle a situation where data source queries are being rate-limited by AWS?

You reference production state from staging environment using `terraform_remote_state`. A junior engineer modifies the staging code to update (not just read) a resource from prod state. They accidentally modify prod infrastructure. How do you prevent this?

Enforce read-only remote state references: 1) IAM permissions: staging state backend role has only `s3:GetObject` on prod state bucket. No write permissions. 2) Terraform validation: add `locals { prod_outputs_are_readonly = true }` comment to prevent accidents. 3) Code review: reviewers check that remote state references are data sources only, never resources referencing remote state outputs for updates. 4) Linting: tflint rule detects if remote state used in resource modification: `if resource references terraform_remote_state and has action != read { deny }`. 5) Testing: test staging apply targets only staging resources. Use `terraform plan -target` to validate. 6) Monitoring: CloudTrail alerts if prod state bucket has unexpected writes from staging role. 7) Documentation: show team that remote state is for reading outputs only. Never modify resources from remote state. 8) Immutable outputs: mark prod outputs immutable in code: `output "prod_rds_endpoint" { value = aws_db_instance.prod.endpoint sensitive = true description = "READ-ONLY: Do not attempt to modify" }`. 9) Separate teams: prod state managed by platform team. Staging team can read, not modify. Trust but verify.

Follow-up: How would you detect if unauthorized modifications to prod state were attempted?