Terraform Interview — Refactoring: Monolith to Modules

Your main.tf is 2000 lines. All infrastructure in one file: networking, compute, data, monitoring. Adding features is hard, testing is hard, team coordination is a mess. Time to modularize. How do you refactor without downtime?

Phased refactoring with state management: 1) Plan refactoring: identify logical boundaries: networking (VPC, subnets, gateways), compute (EC2, ASG, ALB), data (RDS, S3, DynamoDB), monitoring (CloudWatch, SNS). 2) Create module structure: `mkdir -p modules/{networking,compute,data,monitoring}`. Each gets main.tf, variables.tf, outputs.tf. 3) Move code incrementally: for networking module, copy VPC + subnets + gateways HCL from main.tf to modules/networking/main.tf. Don't delete from main.tf yet. 4) Update main.tf to call module: `module "networking" { source = "./modules/networking" vpc_cidr = var.vpc_cidr }`. 5) Test with `terraform plan -refresh=false`. Should show zero changes. 6) Verify state: `terraform state list` should show both old resources and new module references. 7) Move state: use `terraform state mv aws_vpc.main module.networking.aws_vpc.main` for each resource. 8) Remove old HCL: delete resource blocks from main.tf. Run plan again - zero changes. 9) Commit with history: each module move is separate commit for easy rollback. 10) Validate: `terraform validate` passes. Deploy to non-prod first.

Follow-up: How would you handle inter-resource dependencies when moving to modules?

During refactoring, a senior engineer points out that the networking module has 15 variables, many unused. The module is unclear - maintenance nightmare. How do you simplify and clarify?

Refactor module internals for clarity: 1) Audit variables: list all 15 variables. Check which are actually used: `grep -r "\${var\." modules/networking/`. 2) Remove unused: delete variables not referenced. Commit reason: "Remove unused var X (never referenced in networking module)". 3) Group related variables: combine separate `vpc_cidr`, `vpc_name`, `vpc_tags` into single object: `variable "vpc_config" { type = object({ cidr = string, name = string, tags = map(string) }) }`. 4) Add defaults: `variable "availability_zones" { default = ["us-east-1a", "us-east-1b", "us-east-1c"] }` prevents forcing all callers to specify. 5) Add validation: `variable "vpc_cidr" { validation { condition = can(cidrhost(var.vpc_cidr, 0)) error_message = "Must be valid CIDR block" } }`. 6) Document: add detailed comments: `# VPC configuration: defines network topology, availability zones, and routing`. 7) Create examples: `examples/simple.tfvars` shows minimal usage, `examples/advanced.tfvars` shows all options. 8) Test: ensure module still works: `terraform plan` on examples shows expected resources. 9) Refactor callers: update root module to use simplified module: `module "network" { network_config = { cidr = "10.0.0.0/16", zones = 3 } }`.

Follow-up: How would you migrate existing callers to the simplified variable structure without breaking?

You've refactored 1000-line monolith into 5 modules. Deployment takes 2 hours because modules are applied sequentially (networking must finish before compute). Developers are frustrated by slow deploy cycle. Can this be parallelized?

Analyze and parallelize where possible: 1) Dependency analysis: draw diagram of module dependencies. Networking -> none (foundational). Compute -> Networking. Data -> Networking. Monitoring -> Compute + Data. 2) Parallelizable stages: Stage 1 (parallel): Deploy Networking alone. Stage 2 (parallel): Deploy Compute and Data simultaneously (both depend on Networking, not each other). Stage 3 (parallel): Deploy Monitoring (depends on Stage 2). 3) Implement in CI/CD: job 1 runs `cd terraform/networking && terraform apply`. job 2 runs `cd terraform/compute && terraform apply` and job 3 runs `cd terraform/data && terraform apply` simultaneously after job 1 succeeds. 4) Use Make or shell script: `make deploy` runs stages sequentially, within each stage, modules run parallel. 5) Cross-module references: use `terraform_remote_state` data source. Compute module queries Networking remote state for VPC ID: `data "terraform_remote_state" "network" { ... }`. 6) Testing: validate each module independently first to catch errors early. 7) Monitor: log deployment time. Measure parallelization benefit: should be ~40% faster (2 hours -> 1.2 hours).

Follow-up: How would you detect if one module's apply is much slower than expected and blocking other modules?

After modularizing, team discovers 3 modules have significant code duplication: each defines IAM roles, VPC configurations, and tagging logic independently. Extracting further to sub-modules to reduce duplication. Design shared module pattern.

Create foundation modules and composition pattern: 1) Extract shared logic: create `modules/foundation/` subdirectory. 2) Create sub-modules: `modules/foundation/iam-roles/`, `modules/foundation/vpc-base/`, `modules/foundation/tagging/`. 3) Each handles one concern: iam-roles creates common role shapes (compute-role, data-role, lambda-role). vpc-base creates VPC with standard configuration. tagging applies consistent tags. 4) Update composition modules: `modules/compute/` now calls `module "compute_role" { source = "../foundation/iam-roles" role_type = "compute" }` instead of defining inline. 5) Shared tagging: `module "tagging" { source = "../foundation/tagging" resource_name = "compute-${var.environment}" owner = var.owner }` applies tags consistently. 6) Versioning: foundation modules are library. Bump version when changed, requiring explicit update in consumers. 7) Documentation: show team how foundation modules work, what they provide. Examples in `foundation/examples/`. 8) Testing: test foundation modules independently. All composition modules inherit tested behavior. 9) Governance: require code review for foundation module changes (impacts all consumers).

Follow-up: How would you handle different consumers needing slightly different variations of shared foundation modules?

You're refactoring monolith to modules. A critical resource (production database) exists in state but refactoring requires moving it from `aws_db_instance.main` to `module.data.aws_db_instance.main`. You can't afford downtime. How do you do this safely?

Use state mv for zero-downtime refactoring: 1) Plan carefully: test refactoring in non-prod first. Run full test cycle. 2) Backup state: `terraform state pull > prod-backup.tfstate`. Keep this for quick rollback. 3) Extract RDS logic into modules/data/main.tf. Don't include resource blocks yet. 4) In root main.tf, change from `resource "aws_db_instance" "main"` to `module "data" { source = "./modules/data" }`. 5) Before applying, migrate state: `terraform state mv aws_db_instance.main module.data.aws_db_instance.main`. 6) Validate: `terraform state show module.data.aws_db_instance.main` shows RDS. `terraform plan -refresh=false` shows zero changes. 7) Apply: `terraform apply` is no-op because state already reflects module structure. 8) Verify: connect to RDS, confirm still working. No downtime. 9) Rollback plan: if something goes wrong, `terraform state push prod-backup.tfstate` reverts state immediately. 10) Commit: document the state mv command and why it was needed. Next team won't repeat this.

Follow-up: How would you coordinate this with application teams that depend on the database?

Monolith is now 5 modules. Different teams own different modules: Platform team owns networking, DevOps team owns compute, Data team owns data. Coordinating changes is difficult. Design governance for module ownership.

Implement module governance and ownership: 1) Document ownership: create `OWNERS.md` showing module -> team mapping. Networking -> Platform, Compute -> DevOps, etc. 2) Code review requirements: PRs changing module require approval from owning team. Use GitHub CODEOWNERS file: `modules/networking/` -> @platform-team. 3) API contracts: define module inputs/outputs as contracts. Platform module's `output "vpc_id"` is contract that DevOps depends on. Deprecate outputs properly. 4) Breaking changes: require 2-week deprecation notice. old output stays 2 weeks, then removed. Gives consumers time to update. 5) Cross-team PRs: when DevOps needs networking change, DevOps creates PR, Platform reviews before merge. 6) Testing: each team owns tests for their modules. Platform team tests networking module; DevOps tests compute. 7) Integration tests: separate team owns root module and integration tests. 8) Versioning: when module changes, tag release: `v1.2.0`. Consumers update at their pace. 9) SLA: Platform commits to support networking module SLA: PRs reviewed within 24 hours. 10) Documentation: each team documents their module's status, roadmap, known issues.

Follow-up: How would you handle disputes between teams about module responsibility?

Post-refactoring, you want to prevent regression: someone copying-pasting code from main.tf back into monolithic structure. How do you enforce modular architecture?

Implement architecture enforcement: 1) Pre-commit hook: scan main.tf for resource blocks that should be in modules. If `aws_db_instance` found in root main.tf (should be in data module), reject commit. `grep "resource \"aws_db_instance\"" main.tf && exit 1`. 2) CI check: policy as code via Terraform validator or custom script. `main.tf` should only contain module calls, not resource definitions (except rare exceptions). 3) Terraform validate: catches HCL errors. 4) Terraform fmt: enforces style, catches obvious mistakes. 5) Code review: reviewers check for monolithic patterns. 6) Linting: tflint rules enforce modular structure: `rule "module_only_calls_modules" { ... }`. 7) Documentation: README shows approved patterns. Show: `GOOD: module "data" { source = "..." }`, `BAD: resource "aws_db_instance" { ... }` in root. 8) Team training: onboard new engineers on modular approach. Show refactoring examples. 9) Incident postmortem: if regression happens (code copied back to monolith), review what caused it. Fix process. 10) Automated remediation: if regression detected, create PR to move code back to modules automatically.

Follow-up: How would you handle legitimate exceptions where resources must stay in root module?