AWS Interview — CloudFormation vs CDK vs Terraform

Your organization has 200 AWS services across prod/staging/dev environments. You currently use CloudFormation templates (YAML) maintained in Git. Templates are 50K lines total, 100+ Stack interdependencies, change velocity is high (10 deployments/week). Developers complain: "YAML is hard to debug, we want loops/functions." You're considering CDK or Terraform. Justify your choice to the infrastructure team.

Decision framework for 200-service org: (1) CloudFormation YAML status quo: Pros: (a) AWS-native, no additional tools, integrated with AWS Console. (b) 50K lines is manageable with Nested Stacks pattern (split into logical groups). (c) Cons: YAML is declarative but brittle (indentation errors, no validation until deployment), no functions/loops (must use Jinja2 pre-processor hacks), change velocity slows due to template rewrites. (2) CDK evaluation: (a) Abstraction layer: write infra in Python/TypeScript, compiles to CloudFormation JSON. (b) Pros: programming language (loops, functions, classes), better IDE support, reusable constructs (L3 components). 200 services become 30-50 reusable constructs. (c) Cons: AWS-only (not portable), adds 1-2 sec compilation step to each deploy, learning curve for ops team. (d) Fit: excellent for AWS-only shops, fits current CloudFormation constraints, accelerates velocity. (3) Terraform evaluation: (a) Multi-cloud: Terraform works on AWS, GCP, Azure. Future-proof if company pivots. (b) Pros: strong type system, HCL is more readable than YAML, powerful modules system (Terraform Registry = 10K+ community modules). (c) Cons: state file management (must store in remote backend, adds complexity), not AWS-native (learning curve), requires Terraform enterprise for team workflows ($200+/month). (d) Fit: better for multi-cloud, smaller teams (10-20 people). For 200-service org, state complexity explodes. (4) Recommendation: CDK. Reasoning: (a) Lowest migration friction: compile CDK → CloudFormation JSON, replace existing templates gradually. (b) Language power: loops/functions solve YAML pain points in week 1. (c) AWS-optimized: no state file complexity. (d) Velocity: 100 devs writing CDK constructs can reuse them instantly (vs Terraform modules = steeper adoption). (e) Cost: $0 (CDK is open-source), vs Terraform Enterprise = $2.4K/year. (5) Migration plan: (a) Week 1-2: convert 5 core stacks to CDK (EC2, RDS, VPC). Document patterns. (b) Week 3-8: convert remaining stacks. 200 services = 40 stacks, ~1 week per 10 stacks with team. (c) Week 9+: ops maintains CDK codebase (no more YAML drudgery). Timeline: 8-10 weeks. (6) Alternative: if firm deadline prevents 8-week migration, use CDK for new services + maintain CloudFormation for existing (hybrid 12-18 months). Expensive but pragmatic.

Follow-up: You chose CDK but half your team only knows YAML/ops, not Python. Training takes 6 weeks. In the meantime, do you block deployments or use hybrid CDK+CloudFormation?

Your team uses Terraform to manage 150 resources across 3 AWS accounts (prod, staging, dev). Terraform state file is stored in S3 (backend). Last week, a developer accidentally ran `terraform destroy` on prod state file before applying changes. All infrastructure was marked for deletion (not actually deleted, but state was corrupted). You recovered from backups but spent 8 hours fixing state. How do you prevent this in a team of 12 developers?

Prevent state corruption via Terraform Enterprise/Cloud + access controls: (1) Root cause: state file was mutable by any developer with S3 access. No approval workflow, no immutable history. (2) Solution #1: Terraform Cloud (paid, official): (a) Centralized state management (Terraform-managed, not DIY S3). (b) Locking: automatic state locks during apply (prevents concurrent applies). (c) VCS integration: every Terraform change requires GitHub PR. Changes applied only after PR approval. (d) Cost: $200-500/month per team. Prevents future incidents. (3) Solution #2: DIY S3 backend with guards (cheaper): (a) Enable S3 versioning on state bucket (recover any corrupted state in seconds). (b) Enable S3 MFA Delete (requires MFA to delete objects, blocks accident). (c) Add bucket policy: deny DeleteObject except for specific IAM role (s3-state-admin). Developers cannot delete. (d) Local state locks: implement DynamoDB lock table for Terraform state locks (prevents concurrent applies). (e) Cost: S3 versioning = $0.023/GB-month (cheap), DynamoDB lock = $0.25/month. (4) Operational controls: (a) Code review: all Terraform changes go through GitHub PR. Lead reviews before approval. (b) Plan review: require `terraform plan` output in PR description. Developers review impact before apply. (c) Approval workflow: only deploy bot (CI/CD) can apply; no human runs `terraform apply` locally. (d) Audit trail: CloudTrail logs all S3 state file changes (read/write/delete). Weekly audit: any unexpected changes trigger alert. (5) Implementation: 1 week (S3 policy, DynamoDB lock setup, GitHub Actions for CI/CD integration, docs). (6) Training: 1-hour team session: "Never run terraform destroy locally, only in CI/CD." Reenforce: "If you accidentally run destroy, rollback via S3 versioning + DynamoDB lock restore." Result: prevents future incidents, enables safe team deployments.

Follow-up: Terraform state file is now locked by a CI/CD run that crashed. The lock isn't released (DynamoDB entry stuck). Developers are blocked. How do you recover?

You're comparing CloudFormation StackSets vs Terraform + CI/CD for multi-account deployments. Your company has 50 AWS accounts (customers, each a separate account). You need to deploy an SNS topic + Lambda to all 50 accounts in <5 minutes. Which tool and why?

CloudFormation StackSets is purpose-built for multi-account deployments; use it. Comparison: (1) CloudFormation StackSets: (a) Native multi-account deployment orchestration. Deploy single template to 50 accounts across regions in parallel. (b) Operation: Create StackSet in central account (root), specify target accounts. StackSets auto-deploys to each. (c) Deployment time: 3-5 minutes for 50 accounts (parallel). (d) Pros: simple, AWS-native, built-in rollback (if 1 account fails, stop). (e) Cons: limited to CloudFormation resources (not third-party), limited to ~1000 StackSet instances per account. (f) Cost: free (CloudFormation charges per stack, not StackSet). (2) Terraform + CI/CD: (a) Approach: store Terraform configs in Git. CI/CD pipeline loops over 50 accounts, applying Terraform sequentially or in parallel. (b) Deployment time: 15-30 minutes minimum (sequential) or 5-10 minutes (10-way parallel, requires 10 worker agents in CI/CD). (c) Pros: portable across clouds, fine-grained control, state management. (d) Cons: requires CI/CD infrastructure (Jenkins, GitHub Actions, 10 workers = cost), state file per account (50 state files = management overhead), rollback is manual (delete StackSet, run terraform destroy). (e) Cost: $500-2K/month for CI/CD workers, Terraform state backends. (3) Verdict for 50 accounts: StackSets. Rationale: (a) 3-5 min deployment is native (no custom CI/CD). (b) 50 accounts = high cardinality, StackSets built for this exact use case. (c) Cost is $0 (vs Terraform CI/CD = $1K+/month). (d) Change management: when an account fails to deploy, StackSet halts automatically (Terraform requires manual intervention). (e) Rollback: StackSet maintains versioning; revert in 1 click vs manual terraform destroy. (4) Implementation: 2 weeks (design StackSet template, test with 5 accounts, then deploy to all 50, create runbook for operators). (5) Note: if you need infrastructure-as-code across GCP/Azure too, Terraform is unavoidable. But for AWS-only, StackSets is faster + cheaper.

Follow-up: StackSet deployment to 50 accounts partially fails on account #23 (permissions issue). Other 49 accounts have SNS topic deployed but account #23 is stuck in CREATE_IN_PROGRESS. How do you fix without rolling back the other 49?

Your company has a hybrid cloud strategy: AWS (primary) + GCP (backup). You need to deploy identical infrastructure (databases, APIs, storage) to both clouds. You're torn between Terraform (portable) and CloudFormation (AWS-optimized). Your team knows CloudFormation well but not HCL. Time-to-deploy is critical. What's your recommendation?

Use Terraform for portability but design for expertise. Detailed analysis: (1) Terraform multi-cloud capability: (a) Write once, deploy to AWS + GCP (same HCL logic). (b) Providers: AWS provider + GCP provider. Switch via variable (var.cloud_provider). (c) Resources are cloud-specific, so you still need to rewrite schemas for GCP equivalents. (d) Cons: Terraform learning curve (HCL, state management, modules), no team expertise (6+ weeks training). (2) CloudFormation + Terraform hybrid: (a) AWS: use CloudFormation (team expertise, fast). (b) GCP: use Terraform (only way). (c) Cons: two tools, no portability, organizational complexity (ops team maintains both). (d) Pros: leverage existing CloudFormation skills, deploy AWS in 2 days vs 8 weeks Terraform learning. (3) Recommendation: Terraform now, CloudFormation sunset later. Rationale: (a) Short-term pain (8 weeks training) vs long-term gain (1 tool for 2 clouds, reusable modules). (b) Investment: 8 weeks team training = $50-80K cost, but saves $100-200K/year in ops maintenance (one tool, no context switching). (c) Risk: if hybrid cloud fails (company pivots to AWS-only), CloudFormation was the right choice. If hybrid succeeds, Terraform was essential. (4) Migration strategy to minimize pain: (a) Week 1-3: train team on Terraform (3-hour daily sessions, 15 hours total per dev). (b) Week 4-6: run CloudFormation and Terraform in parallel. Developers write Terraform for new services, maintain old CloudFormation stacks. (c) Week 7-12: convert 20% of CloudFormation stacks to Terraform (highest-ROI stacks first). (d) Week 13+: convert remaining stacks at 10% per month. (5) Cost: training = $10K (contractor), parallel tooling = $500/month (Terraform Cloud), team time = ~$50K opportunity cost. Total: $65K over 6 months. ROI: saves $150K/year vs operating two tools long-term. Break-even: 6 months. (6) Fallback: if training is too slow, use Terraform for new GCP services, keep CloudFormation for AWS. Planned 2-year migration to all-Terraform (less disruptive).

Follow-up: You're running Terraform and CloudFormation in parallel. A developer modifies a resource in CloudFormation, then Terraform tries to deploy the same resource. State divergence. How do you resolve?

Your team has 500 CloudFormation templates (nested stacks). When you change a parent stack, it can take 20+ minutes to update all child stacks (AWS processes serially). Developers are frustrated with deployment velocity. Someone suggests: "Let's migrate to CDK for faster feedback loops." How do you evaluate if CDK actually solves this?

CDK won't solve this; the bottleneck is AWS CloudFormation (not the template language). Analysis: (1) Root cause: CloudFormation processes stack updates serially. 500 stacks × 2 min per stack = 1000 min theoretical worst-case. With batching (parallel stacks), still 50-100 min. This is AWS API limitation, not template format. (2) What CDK does: (a) Syntax sugar: Python/TypeScript instead of YAML. (b) Compilation: CDK → CloudFormation JSON in <1 sec. (c) Deployment: still uses CloudFormation, same serial batching issue. (d) CDK does NOT speed up CloudFormation updates. Total time = CDK compile (1 sec) + CloudFormation update (20 min) = ~20 min. No gain. (3) Actual solutions to speed deployment: (a) Decouple stacks: instead of 500 nested stacks, design as 50 independent stacks (10:1 reduction). Each stack update = 1-2 min. Total = 2-10 min (10x faster). Requires architecture rethink but worth it. (b) Use Stack policies: allow partial updates. If changing a Lambda, don't update RDS. Scope updates to affected resources only. (c) Change sets: AWS CloudFormation Change Sets preview changes without applying. Developers review in <5 sec instead of deploying blind. (d) Parallel infrastructure: if deployment takes 20 min and you deploy 10x/day, that's 3+ hours/day blocked. Invest in parallelization: use Lambda to update multiple stacks concurrently (AWS service limit = up to 10 parallel stacks). Reduce 20 min to 2 min. (4) Verdict: CDK will NOT solve this specific problem. The issue is architecture (500 nested stacks), not syntax. (a) First try: decouple architecture (weeks 1-4, architectural review). (b) If still too slow: investigate stack parallelization (Lambda-driven bulk updates). (c) CDK is fine for new work but won't fix legacy template velocity. (5) Alternative: if velocity is critical, migrate to infrastructure orchestration tools (Spinnaker, Harness) that manage CloudFormation in parallel + feature flags + canary deployments. But adds operational complexity (team expansion from 5 to 10 ops engineers needed).

Follow-up: You decoupled 500 stacks into 50. Deployment time dropped to 8 min (good). But now you have cross-stack dependencies (Stack A needs output from Stack B). If Stack B fails, Stack A deployment is stuck. How do you handle stack dependencies safely?

Your infrastructure team has moved from CloudFormation to Terraform. After 6 months, you've accumulated 10TB of Terraform state files (historical), multiple git branches with divergent configs, and 3 developers have different versions of the same state file in local S3 caches. State is inconsistent. A junior dev runs `terraform apply` and accidentally scales production from 5 to 50 m5.xlarge instances (wrong environment variable). How do you prevent this?

Implement Terraform state hygiene + policy guardrails: (1) State file chaos diagnosis: (a) 10TB of state = 100+ state files, suggests lack of organization. Terraform state should be <100MB per stack typically. (b) Multiple versions in local caches = no source of truth. Developers run `terraform apply` against stale local state → reality divergence. (c) Solution: centralize all state in Terraform Cloud or S3 with DynamoDB locking (single source of truth). (2) Environment variable accident prevention: (a) Root cause: developer used `export ENVIRONMENT=prod` in shell, then ran `terraform apply`. Terraform read the env var and applied to prod. (b) Controls: (i) Remove env vars from shell. Use tfvars files instead (prod.tfvars, staging.tfvars). (ii) Require explicit flag: `terraform apply -var-file=prod.tfvars`. (iii) Add safety check: in Terraform, assert environment at top of script: ```hcl variable "environment" { validation { condition = contains(["dev", "staging", "prod"], var.environment) error_message = "Environment must be dev/staging/prod." } } ``` (iv) Pre-commit hook: block apply if environment not explicitly specified. (3) State file reorganization: (a) Consolidate 10TB → ~500GB by deleting old versions (keep 12 months). Use S3 lifecycle policies to archive old versions to Glacier. (b) Separate state by concern: vpc.tfstate, rds.tfstate, iam.tfstate (not monolithic). ~50MB each. (c) Link states via remote data sources (terraform_remote_state) for cross-stack dependencies. (4) Deployment approval workflow: (a) Plan-only CI/CD: `terraform plan -out=tfplan`. Output attached to PR. Humans review before apply. (b) Approval gate: require 2 approvals for prod changes (policy in GitHub Actions). (c) Apply via CI/CD only (not locally). Developers cannot run `terraform apply` in production (only deploy bot can). (5) Detect divergence: (a) Daily reconciliation job: fetch actual AWS resources (describe-instances, etc.) and compare to Terraform state. Alert if drift >5%. (b) AWS Config rules: detect manual changes to resources (outside Terraform). Auto-remediate by reverting to Terraform-managed state. (6) Implementation: 3 weeks (Terraform Cloud setup, pre-commit hooks, CI/CD approval gates, Config rules, state cleanup). Cost: Terraform Cloud $500/month, AWS Config $200/month. Saves future incidents (prevented $20K+ downtime cost from production scaling).

Follow-up: Prod environment is now deploy-only via CI/CD, but a production incident requires immediate scaling to 50 instances. CI/CD pipeline takes 15 min to run. Incident needs fix in 2 min. Do you allow manual override, or stick with CI/CD-only policy?

Your CDK codebase has 80K lines of Python (20 services). You hire a new team that specializes in Terraform. They ask: "Can we contribute to infra code?" Currently, CDK knowledge is concentrated (3 devs). Terraform knowledge is broader in industry. What's your strategy for knowledge distribution?

Hybrid CDK+Terraform strategy for team knowledge distribution: (1) Current state: 80K lines CDK (Python), 3 devs. New hires know Terraform, not CDK. Knowledge bottleneck. (2) Options: (a) Force new hires to learn CDK: 6-week ramp-up, slows onboarding, frustration. (b) Migrate all to Terraform: 12-week effort, breaks existing CDK expertise. (c) Hybrid approach (recommended): use CDK for core infrastructure, Terraform for new services/modules. (3) Hybrid design: (a) CDK maintains: VPC, IAM, core networking, shared resources (3 devs, 4K lines, stable). (b) Terraform manages: application microservices, customer-specific stacks, ephemeral resources (new team, 1K lines/month growth, high velocity). (c) Integration: CDK exports outputs (VPC ID, subnet IDs, IAM role ARN) to SSM Parameter Store. Terraform reads via data source. (d) This allows two teams to work independently without blocking each other. (4) Implementation: (a) Week 1: identify CDK services that are stable (never change). Keep in CDK. (b) Week 2-3: migrate 20% of newer services to Terraform template (reusable module). (c) Week 4+: new hires write Terraform modules for new services. (d) CDK team maintains core infrastructure, doesn't need to grow. (5) Knowledge transition: (a) Pair programming: 1 CDK dev + 1 Terraform dev on each new service. CDK dev learns Terraform, Terraform dev learns CDK (cross-pollination). (b) Documentation: maintain runbook for each tool. New hires start with runbook (1 hour), then pair (8 hours). (c) Gradual sunsetting: over 18 months, CDK handles <10% of infrastructure. Eventually deprecate if maintenance burden grows. (6) Cost: minimal (same infrastructure, different tooling). Risk: two tools = ops complexity. Payoff: team velocity +40% (reduced CDK bottleneck).

Follow-up: You've split CDK and Terraform, but a new service requires both VPC (CDK) and Lambda (Terraform). CDK exports VPC ID to SSM, but Terraform data source is stale (cached). Lambda can't connect to correct VPC. How do you ensure fresh references?