Your team uses GitOps: Terraform PRs get plan commented on, manual approve/apply via commands. This is tedious: someone has to watch PR, wait for plan, comment approval, manually apply. You want full automation. Evaluate Atlantis.
Atlantis solves GitOps automation challenges: 1) Automatic plan on PR: GitHub webhook triggers Atlantis to run `terraform plan`, posts results as PR comment automatically. No manual trigger. 2) Approval workflow: comment `atlantis approve` to approve. Atlantis auto-applies without manual intervention. 3) Lock management: Atlantis manages state locks, prevents race conditions (two people can't apply simultaneously). 4) Audit trail: Atlantis logs all plans/applies, who approved, when. Git history + Atlantis logs = full audit. 5) Multiple environments: single Atlantis instance manages dev, staging, prod. Each environment has separate state and approval requirements. 6) Deployment: host Atlantis on server or Kubernetes. GitHub webhook points to Atlantis. 7) Configuration: `atlantis.yaml` in repo defines how to handle Terraform. Example: `projects: - name: prod dir: terraform/prod workspace: prod apply_requirements: [approved, mergeable]`. 8) Cost: self-hosted, free. Scales to teams of 100+. 9) Alternative: Terraform Cloud might be simpler if less customization needed. Atlantis if you want full control.
Follow-up: What are the operational requirements for running Atlantis at scale?
You've deployed Atlantis. A team member submits PR that would delete production database. They comment `atlantis approve` and it auto-applies without your review. How do you prevent this?
Implement multi-layer approval gates: 1) Atlantis configuration: require multiple approvals for prod: `atlantis.yaml`: `projects: - name: prod apply_requirements: [approved, mergeable] required_merge_check: default-status-check-context`. 2) Require code review: `atlantis.yaml` can't enforce this directly. Use GitHub branch protection: require 2 code reviews before merge. Atlantis apply happens after merge. 3) Use policy as code: Sentinel policies in Atlantis detect dangerous operations: `if plan includes destroy on critical resources, fail`. Add to `atlantis.yaml`: `hooks: apply: - run: bash validate_plan.sh`. 4) Require approval comment: `apply_requirements: [approved]` means someone must type `/atlantis approve` before apply. Team consensus: only leads can approve. 5) Pre-apply checks: on `atlantis apply`, run custom script that checks for destruction: `if terraform show plan.json | grep destroy; then exit 1; fi`. 6) Slack notifications: Atlantis posts to Slack before applying: "Production apply pending. Resources to destroy: [db-instance]. Approve in Slack?" 7) Lock prod: prevent applies during business hours without escalation. 8) Training: show team that auto-approval exists, emphasize that reviewing plans before approving is critical.
Follow-up: How would you implement automatic rollback if apply fails?
You use Atlantis with multiple teams. Team A manages dev, Team B manages prod. Atlantis state grows complex. Managing 30 projects with different approval requirements is becoming unmaintainable. How do you organize Atlantis at scale?
Scale Atlantis with governance: 1) Separate Atlantis instances: instead of single Atlantis managing all, deploy: Atlantis-dev (for dev projects, self-service), Atlantis-prod (for prod, strict approval). Separate GitHub webhooks point to each. 2) Or centralized with RBAC: single Atlantis, but uses GitHub team membership to determine approvers. Only prod-admins team can approve prod applies. 3) Define projects clearly: `atlantis.yaml` per environment: `terraform/dev/atlantis.yaml`, `terraform/prod/atlantis.yaml`. Each has own config. 4) Approval requirements: tier by risk: dev projects: `apply_requirements: [mergeable]` (no explicit approval). staging: `apply_requirements: [approved]`. prod: `apply_requirements: [approved, mergeable]` plus custom script validation. 5) Team ownership: use CODEOWNERS. GitHub shows who should review. Atlantis respects this. 6) Runbooks: document approval process. What constitutes "safe apply"? 7) Monitoring: track apply success rate per team. If Team A has 80% failure rate, investigate. 8) Backup: Atlantis configuration and state stored in git. If Atlantis crashes, redeploy from config. 9) Scaling: for 100+ projects, consider managed solutions (Terraform Cloud) instead of self-hosted Atlantis.
Follow-up: How would you detect if Atlantis is causing unintended applies due to misconfiguration?
Atlantis deployment uses Kubernetes. Pod crashes mid-apply. Terraform apply partially succeeds (50 resources created, error on resource 51). Atlantis pod restarts. State is inconsistent. How do you prevent and recover?
Design resilient Atlantis: 1) Pod crash prevention: allocate sufficient resources. Atlantis is lightweight but state operations can be heavy. Set memory limit: `memory: 512Mi`. Monitor CPU/memory usage. 2) Persistent state: store Atlantis data (locks, logs) on persistent volume: `volumeMounts: - name: atlantis-data mountPath: /atlantis`. If pod crashes, new pod recovers state. 3) Database-backed: instead of file-based state, use PostgreSQL for Atlantis backend storage. More resilient. 4) Apply timeout: set max apply time. If apply takes > 30 min, kill it: `atlantis server --tf-apply-timeout 30m`. Prevents hanging. 5) On crash: force-unlock: `atlantis unlock --project prod` releases lock so next apply can proceed. 6) Terraform state recovery: if partial apply, run `terraform refresh` to sync state with AWS, then `terraform plan` to see remaining work. 7) Manual intervention: if Atlantis broken, fallback to manual terraform: direct SSH to Atlantis pod, run `terraform apply`, fix issues. 8) Monitoring: alert on pod crashes. Investigate root cause. 9) Testing: chaos engineering - intentionally kill Atlantis during apply. Verify recovery works.
Follow-up: How would you prevent developers from bypassing Atlantis and running terraform directly?
You integrate Atlantis with your CI/CD system. Currently, test failures block apply. But Atlantis doesn't know about CI test results - it just sees PR is merged. How do you integrate test results into Atlantis approval?
Integrate CI/CD with Atlantis approval: 1) GitHub status checks: CI pipeline sets commit status: `checks: [lint, test, security-scan]`. GitHub requires all checks pass before merge. Atlantis respects this: `apply_requirements: [mergeable]` fails if any check fails. 2) Custom webhooks: CI posts test results to Atlantis via webhook. Atlantis stores results. Apply handler checks: if test failed, block apply. 3) External validation: on `atlantis apply`, run external script that queries CI system: `curl -H "Authorization: Bearer $CI_TOKEN" https://ci.company.com/results/$PR_NUMBER | jq '.status'`. If failed, exit 1. Block apply. 4) Run tasks: Atlantis run tasks can call external systems: `hooks: apply: - run: bash check_ci_status.sh`. 5) Slack integration: Atlantis notifies Slack: "PR passed tests. Safe to apply". 6) For safety: require human approval even if tests pass. Approver manually verifies test results. 7) Documentation: show team that apply requires: tests pass + human approval. 8) SLA: CI should complete within 10 min. If > 10 min, escalate. Atlantis apply waits.
Follow-up: How would you handle a flaky test that sometimes passes and sometimes fails?