Terraform Interview — Testing with Terratest and Validation

You're building Terraform modules for a platform. A team member's change to the networking module causes security group rules to break in production. Testing found it too late. Design a testing framework that catches this before deploy.

Implement multi-layer testing with Terratest: 1) Unit tests: `terraform validate` ensures syntax is correct. Add to every PR. 2) Lint: `terraform fmt -check` validates style. 3) Security scanning: `tfsec` checks for insecure configurations: open security groups, unencrypted storage, etc. 4) Terratest unit tests: test module in isolation. Example: `test "TestNetworkingModule" { options := &terraform.Options{ TerraformDir: "../modules/networking" } defer terraform.Destroy(t, options) terraform.Init(t, options) terraform.Apply(t, options) vpc := terraform.Output(t, options, "vpc_id") assert.NotEmpty(t, vpc) }`. 5) Terratest integration tests: deploy full stack, verify components work together. 6) Infrastructure tests: after apply, query AWS API to verify resources match intent: `output, err := aws.GetSecurityGroupById(...) assert.Equal(t, expected_rules, output.IpPermissions)`. 7) CI/CD: run all tests on PR, block merge if any fail. 8) Document: show team how to add tests when adding features.

Follow-up: How would you structure tests for modules used across 50 different root modules?

Your Terratest suite deploys real AWS resources for testing. This takes 20 minutes per test, costs money, and is fragile (AWS API rate limits). How do you make testing faster?

Implement layered testing strategy: 1) Fast syntactic tests (10 seconds): `terraform validate`, `terraform fmt`, `tfsec`. Run on every commit. 2) Medium unit tests (1 minute): mock AWS with `terraform mock` or `LocalStack`. Test HCL logic without AWS calls. 3) Selective integration tests (20 minutes): run full AWS deployment only for changed modules. Skip unchanged modules. Use test flags: `go test -run TestNetworkingModule ./test` to run specific test. 4) Nightly full suite: run all tests against real AWS in staging account. 5) For cost optimization: reuse deployed resources across tests instead of creating fresh each time. 6) Use shorter-lived resources in tests: spot instances instead of on-demand, dev-tier RDS instead of multi-AZ. 7) Parallel testing: run multiple tests simultaneously on different machines. `t.Parallel()` in Go. 8) Cache results: if test passes, skip redeploying identical config. 9) Document: show which tests are required before merge (fast ones) vs nightly (slow ones).

Follow-up: How would you handle test resource cleanup if tests fail mid-way?

You want to validate that a Terraform module correctly provisions infrastructure without actually creating AWS resources. For example, verify security group rules are correctly set before deploying. Design validation without AWS.

Use plan-based testing and static analysis: 1) Validate HCL without apply: `terraform plan -json > plan.json` and `jq '.resource_changes[]' plan.json` to inspect what would be created. 2) Write validation scripts to check plan: parse JSON, verify security groups have expected rules: `jq '.resource_changes[] | select(.type=="aws_security_group_rule") | .change.after' plan.json | verify_rules.py`. 3) Use `terraform test` (1.6+): write HCL test files that verify logic: `terraform test "TestSecurityGroupRules" { run "plan" { command = plan }... assertions { result = contains(resource.aws_security_group_rule.ingress[*].from_port, 443) } }`. 4) Use policy as code: `Sentinel` policies validate plan before apply: `policy "check_sg_rules" { main = rule { all_resources.aws_security_group_rule as sgr { sgr.from_port > 0 and sgr.from_port < 65536 } } }`. 5) Schema validation: `terraform validate` checks resource attributes against schema. 6) Combine: validation + static analysis catches most issues without AWS.

Follow-up: How would you test modules that depend on external data sources (e.g., reading current VPC)?

You have 20 Terratest tests. Three are flaky: pass sometimes, fail sometimes due to AWS API inconsistencies or timing issues. How do you make tests reliable?

Implement robust test patterns: 1) Add retry logic: `terraform.Apply(t, options)` might fail on rate limit. Wrap: `for attempts := 0; attempts < 3; attempts++ { err = terraform.Apply(t, options); if err == nil { break } time.Sleep(30 * time.Second) }`. 2) Add wait conditions: after creating RDS, wait for status to be `available`: `aws.WaitForRDSAvailable(t, db_id, 10*time.Minute)`. 3) Isolate tests: each test uses unique names: `test-sg-${random-string}` to prevent conflicts. 4) Mock external dependencies: if test calls external API, mock response. Use `github.com/jarcoal/httpmock`. 5) Increase timeouts: AWS operations take time. Set generous timeouts: `WaitTimeout: 10 * time.Minute`. 6) Clean up explicitly: `defer terraform.Destroy(t, options)` ensures cleanup even on test failure. Add checks: `if resources_left, err := aws.ListResources(...); len(resources_left) > 0 { log.Fatalf("Cleanup failed") }`. 7) Log extensively: print all API calls, responses, timing. Makes debugging easier. 8) Test in isolation: run flaky test 10 times to reproduce, then debug.

Follow-up: How would you integrate flaky test detection into CI/CD to prevent false negatives?

Your Terratest suite is comprehensive but takes 2 hours to run. CI/CD gates all PRs behind this test. Developers wait too long for feedback. How do you optimize test throughput?

Implement staged test pipeline: 1) Stage 1 - Pre-commit (30 seconds): `terraform validate`, `fmt`, `tfsec` run locally before push. Blocks bad commits immediately. 2) Stage 2 - Fast tests (5 minutes): `terraform validate`, lint, policy checks run in CI. Block merge if fail. 3) Stage 3 - Selective unit tests (15 minutes): run only tests for changed modules. Skip unchanged. Use `git diff origin/main -- modules/` to detect changes. 4) Stage 4 - Full integration tests (2 hours): run nightly on main branch, not on every PR. 5) Parallel execution: split tests across multiple CI runners. `go test -parallel 4 ./test`. 6) Cache: store built artifacts, Terraform cache. Skip re-downloading providers. 7) Smaller test data: use minimal resources in tests (t3.micro instead of t3.large). 8) Feature flags: complex tests behind toggle. Run only when relevant module changed. 9) Document: show developers which tests run on PR (fast) vs main (slow).

Follow-up: How would you prevent test optimizations from missing regressions?

A team member adds complex HCL logic with nested conditionals and for_each. Testing shows it works, but a later change breaks it. You need to validate HCL logic is correct without deploying. Design HCL validation strategy.

Use static HCL analysis and dry-run validation: 1) Syntax validation: `terraform validate` catches syntax errors. Run on every change. 2) Semantic validation: `terraform plan -refresh=false` on local state to catch logic errors without AWS API calls. 3) Use `terraform console` interactively: test complex expressions: `for k, v in var.services : k => v.memory` to verify for_each logic. 4) Write HCL test functions: extract complex logic into functions, test separately: `function = contains(var.enabled_services, var.current_service)`. 5) Document assumptions: complex logic should have comments explaining intent. 6) Code review: peer review complex HCL. 7) Use tflint: `tflint` checks for common mistakes: unused variables, incorrect resource types. 8) Implement custom checks: write Go scripts to parse HCL AST and verify specific rules: `all security_group_rules have description`. 9) Test on staging: deploy to staging environment, verify behavior matches expectation before prod.

Follow-up: How would you automate detection of logic bugs in complex HCL?

You want to validate that a module follows company standards: security groups must be explicitly defined (not default), all resources must be tagged, IAM policies must follow principle of least privilege. Design automated compliance validation.

Use policy-as-code frameworks: 1) Sentinel (Terraform Cloud): `policy "require_tags" { main = rule { all_resources as r { r.tags contains "Environment" and r.tags contains "Owner" } } }`. 2) Add custom rules: `policy "require_explicit_sg" { main = rule { all resources.aws_security_group as sg { sg.vpc_id is not empty } } }`. 3) OPA (Open Policy Agent): `deny[msg] { resource := input.resource_changes[_]; resource.type == "aws_security_group"; not resource.change.after.ingress; msg := sprintf("SG %s missing ingress rules", [resource.name]) }`. 4) Checkov: `checkov -f main.tf --check CKV_AWS_21` validates specific checks. 5) For custom standards: write Go validation tool. Parse plan JSON, verify: tags exist, security groups explicit, IAM policies scoped. 6) Integrate into CI: `checkov` runs on every PR, blocks merge if violations found. 7) Enforce with Sentinel in Terraform Cloud: hard failure on policy violation. 8) Document standards: show team which rules are checked, why they matter.

Follow-up: How would you handle exceptions to compliance rules (e.g., temporary elevated IAM permissions)?

You're testing a module that interacts with existing infrastructure (imports state of VPC created outside Terraform). Tests can't replicate existing state consistently. How do you test modules with external dependencies?

Use data sources for external dependencies: 1) Module should query existing VPC via data source: `data "aws_vpc" "main" { id = var.vpc_id }` instead of creating VPC. 2) Test setup: create VPC once (outside Terraform test), reuse across tests. 3) Mock data source: in unit tests, mock `aws_vpc` data source: `terraform mock aws_vpc { id = "vpc-123" }`. 4) Integration tests: use existing VPC: set `vpc_id = "vpc-prod-main"` in test terraform.tfvars. 5) For temporary test VPCs: Terratest creates VPC before test, passes `vpc_id` to module, cleans up after. 6) Isolate: each test uses unique VPC to prevent conflicts. 7) Document: show module expects VPC to already exist, pass VPC ID as input. 8) Add validation: module should verify VPC exists: `data "aws_vpc" "main" { ... validate { condition = self.id != "" error_message = "VPC not found" } }`.

Follow-up: How would you test a module that depends on 3 different existing resources?