Your team uses local Terraform but faces challenges: state management across team members, reproducibility issues (different versions), and compliance requirements. You're evaluating Terraform Cloud. What business benefits justify the switch?
Business benefits and technical advantages: 1) State management: centralized, encrypted, versioned. No more state files on laptops. 2) Reproducibility: all runs use locked provider versions. Consistent behavior across team. 3) Audit trail: every apply shows who made changes, when, what. Compliance requirement met. 4) Cost estimation: TFC shows cost impact before applying. Prevents accidental expensive deployments. 5) Policy as code: Sentinel policies enforce company standards automatically. 6) Collaboration: team members review plans before apply via UI. 7) VCS integration: automatic plan on PR, apply on merge. 8) Remote runs: TFC executes applies, not developer laptops. Reduces attack surface. 9) State locking: native support prevents race conditions. 10) Disaster recovery: built-in state backups, 99.95% SLA. Cost: Terraform Cloud free tier handles 5 workspaces. Standard tier ~$40/month per organization. For teams of 10+, cost is minimal vs risk of manual state management.
Follow-up: What migration concerns would prevent switching to Terraform Cloud?
You're migrating 50 local Terraform states to Terraform Cloud. Some team members have stale states, others have never run Terraform. How do you coordinate this migration without downtime?
Phased migration strategy: 1) Phase 1 - Pilot (Week 1): migrate dev environment to TFC. Non-critical, low risk. Team gets familiar with UI. 2) Phase 2 - Staging (Week 2-3): migrate staging. Larger state, higher stakes. Test disaster recovery: rollback state, verify recovery works. 3) Phase 3 - Production (Week 4+): migrate prod. Requires 2 approvals, maintenance window. 4) For each migration: 1a) Create TFC workspace: `terraform cloud config`. 2b) Push state: `terraform state push`. Verify `terraform plan` shows zero changes. 3c) Run `terraform apply` once from TFC to confirm state and apply strategy match. 4) Coordinate team: announce migration date. No local applies after migration begins. 5) Rollback plan: if TFC fails, revert to local state quickly. Keep backup. 6) Training: show team how to approve/apply via TFC UI. 7) Lock old workspaces: after successful migration, lock local workspaces to prevent accidental use. 8) Validate: run `terraform show` in TFC, compare with old local state. Should be identical.
Follow-up: How would you handle a team member who refuses to migrate away from local state?
Your Terraform Cloud organization has 30 workspaces. Different teams manage different workspaces but there are overlapping resources (shared VPC, shared databases). How do you prevent conflicts and ensure consistency?
Implement workspace governance: 1) Organize by ownership: separate teams manage separate workspaces. Team A: workspace-auth, workspace-user-api. Team B: workspace-billing. 2) Use remote state references: shared resources (VPC) managed in workspace-infrastructure. Other workspaces query via `terraform_remote_state` data source. 3) Workspace permissions: TFC allows granular access. Team A cannot access Team B's workspaces. 4) State locking: TFC manages locks. Only one apply at a time per workspace. 5) Sentinel policies enforce consistency: `policy "shared_resource_immutable" { main = rule { all_resources.aws_vpc as vpc { vpc.name matches "shared-*" can only be updated by infrastructure team } } }`. 6) Tagging strategy: all resources tagged with owner team: `Owner = "TeamA"`. Monitoring can verify team only modifies owned resources. 7) Documentation: create CODEOWNERS file mapping workspaces to teams. 8) Audit: TFC audit logs show all applies. Review monthly for conflicts.
Follow-up: How would you detect and prevent unauthorized cross-team resource modifications?
You want to enforce company policy: all infrastructure must be tagged with cost center, environment, and owner. Using local Terraform, developers often forget tags. How does Terraform Cloud help enforce this?
Use Sentinel policies for compliance: 1) Write policy: `policy "require_tags" { main = rule { all_resources as r { r.tags contains "CostCenter" and r.tags contains "Environment" and r.tags contains "Owner" } } }`. 2) Add to TFC workspace policy set. 3) On every plan: TFC runs policy. If tags missing, plan fails. Developer must fix. 4) Soft enforcement (advisory): set policy to `advisory` so developer can override if needed with comment. Hard enforcement: policy fails apply. 5) Custom rules: for sensitive resources (RDS, IAM), require additional tags: `all_resources.aws_db_instance as db { db.tags contains "DataClassification" and db.tags contains "BackupPolicy" }`. 6) Combine with cost estimation: TFC shows cost per tag. "Cost Center X is $50/month". 7) Automated remediation: use TFC run tasks to auto-tag resources if tags missing. 8) Regular audits: TFC reports show compliance percentage. Set target: 100% resources tagged. 9) Training: show team examples of proper tagging in policies. Make it easy to comply.
Follow-up: How would you handle legitimate exceptions to tagging policies?
Your Terraform Cloud workspaces have grown complex. Multiple teams contribute. A PR adds resource that looks innocent but creates a security vulnerability. How does TFC prevent this?
Use Sentinel policies and run tasks for security validation: 1) Sentinel policy: `policy "no_public_databases" { main = rule { all_resources.aws_db_instance as db { db.publicly_accessible == false } } }`. Plan fails if database would be public. 2) Add checks: `policy "require_encryption" { all_resources.aws_ebs_volume as vol { vol.encrypted == true } }`. 3) Custom run tasks: after plan, run external script: `check_security_group_rules.py` that verifies no overly-permissive rules. TFC can block apply if task fails. 4) Integration with external tools: use Checkov as run task: `tfc-run-task checkov --framework terraform`. 5) Cost analysis: if plan shows unusual cost spike (e.g., 100 new instances), alert. Might be mistake. 6) Manual approval gate: require 2 approvals for prod apply. Review plan details before approval. 7) Gradual rollout: apply to dev/staging first. Monitor for issues before prod. 8) Audit trail: TFC logs show exactly who approved what. Security team reviews quarterly.
Follow-up: How would you prevent policy bypasses by developers editing Sentinel policies?
Your organization has strict compliance requirements: all infrastructure changes must be approved by security team before apply. Terraform Cloud has approval flow, but you need to integrate with external approvers (not TFC team members). How do you design this?
Use TFC run tasks with external approvers: 1) After plan: TFC creates run task. External API checks if security team approved: `curl https://compliance-api.company.com/approve/$run_id`. 2) Security team has separate approval system (Slack, PagerDuty, custom tool). 3) On approval: compliance system webhook calls TFC API to mark run task passed. 4) If no approval: run task fails, apply blocked. 5) Implement timeout: if no approval within 24 hours, run task expires. Must replan. 6) Audit: log all approval requests, approvers, times in compliance database. 7) For emergency: add override: `/terraform apply --force-override-security` requires VP approval (separate process). 8) SLA tracking: measure time from plan to approval. If > 1 hour, escalate. 9) Document: show security team how to review plans, what to look for.
Follow-up: How would you handle a situation where security approval is delayed due to holidays?
You're running Terraform Enterprise (self-hosted). You need to scale to 100+ concurrent applies globally. Network latency between regions and the central TFE instance causes slowdowns. How do you optimize?
Implement distributed Terraform Enterprise: 1) Deploy TFE in multiple regions: TFE cluster in us-east (primary), TFE in eu-west (secondary), TFE in ap-southeast (tertiary). 2) Use state replication: primary TFE state syncs to secondaries. Replication lag < 5 seconds. 3) VCS webhooks: route to nearest TFE. Webhook from eu-repo goes to eu TFE instance. 4) Run tasks distributed: each region runs expensive tasks (security scans) locally, not central. 5) Database: use managed RDS with read replicas in each region. Terraform runs read from local replica, write to primary. 6) Network: use VPN/AWS PrivateLink between regions. Low latency. 7) State locking: DynamoDB global table for state locks. Consistent across regions. 8) DNS: use Route53 geolocation routing. Developers' `terraform cloud` cli connects to nearest TFE. 9) Monitor: track apply latency per region. Set SLA: applies complete within 5 minutes. Alert if exceeded.
Follow-up: How would you handle state consistency issues during region failover?
Your organization uses Terraform Enterprise but compliance audit requires proving that no infrastructure changes happened outside of Terraform. How do you prove this?
Implement comprehensive audit trail and detection: 1) TFE audit logs: export to immutable storage (S3 with MFA delete, CloudTrail). Every apply logged with user, timestamp, resources changed. 2) CloudTrail: enable in all AWS accounts. Log all API calls. Compare with TFE logs. Discrepancies = changes outside TF. 3) Config compliance: run AWS Config rules to check resource configurations match TFE state. `config-aggregator` compares actual vs desired. 4) Periodic audits: weekly script: `terraform refresh && terraform plan -json | jq '.resource_changes'` shows drift. 5) Policy enforcement: Sentinel policy that prevents manual changes: if any resource modified outside TF, alert. 6) IAM restrictions: remove direct AWS Console access from developers. All changes go through TFE. 7) Detective controls: CloudTrail alerts on direct AWS API calls from developer machines (not TFE). 8) Documentation: show auditor TFE logs, CloudTrail, Config compliance report. Demonstrate no changes outside TF.
Follow-up: How would you handle a situation where emergency manual infrastructure changes were necessary?