Ansible Interview — Rolling Updates and Serial Strategy

Your production web application runs on 1000 servers. Deployments require updating all servers to new code. When deploying all 1000 simultaneously, application goes down for 5 minutes (startup time). Users experience total outage. How do you implement zero-downtime rolling updates?

Implement serial strategy with health checks. Use `serial: 50` to deploy to 50 servers at a time, not all 1000. Process: 1) Stop 50 servers (remove from load balancer), 2) Deploy new code, 3) Restart service, 4) Verify health checks pass, 5) Add back to load balancer, 6) Repeat for next 50. This keeps 950 servers running at all times. Implement health checks between batches: after deploying batch, verify application is responding. If health checks fail, stop deployment and investigate. Implement gradual traffic shifting: start sending small % of traffic to updated servers, increase gradually. Use load balancer for traffic shifting. For database-dependent apps, implement database migration first (separate from code deployment), then code deploys. Implement connection draining: give existing connections 30 seconds to complete before restarting server. Implement load balancer health checks: load balancer verifies server is healthy before sending traffic. Implement fallback: if deployment fails on any batch, rollback batch to previous version. Implement monitoring: track error rates during deployment, alert if errors increase. Test rolling update process in staging environment to verify zero-downtime capability.

Follow-up: How would you implement automatic rollback if errors spike above threshold during rolling update?

Your rolling update deploys to 1000 servers using `serial: 100`. Process takes 3 hours total (10 batches × 20 minutes per batch). Teams need faster deployments. Simply increasing serial size causes outages. How do you accelerate rolling updates?

Optimize each batch's processing time and increase batch sizes safely. Measure current timing: identify which parts take longest (copying files, restarting, health checks). Optimize file copy: use rsync to copy only changed files (incremental). Pre-stage files to servers during off-hours: deployment just activates pre-staged code (swap symlinks). Optimize service restart: use warm restart (don't kill all connections), cache application state. Pre-warm caches: application starts faster if caches are pre-populated. Parallelize within batch: instead of serializing tasks, use async + poll for concurrency within batch. Use `serial: 250` if application can tolerate 250 servers down simultaneously. Test maximum safe serial size in staging. Implement batching strategy: small batches initially (canary: 1, early adopters: 50), larger batches later (100, 200). Pre-test deployment: run deployment to empty test environment first to warm caches. Implement parallel deployment regions: deploy to US and Europe simultaneously. Implement quick rollback: if errors spike, auto-rollback within 5 seconds (revert symlinks, restart). Implement health checks that fail fast: if health check can complete in 10 seconds instead of 30 seconds, saves time per batch.

Follow-up: How would you implement canary deployment where 1% of traffic goes to new code first?

Your rolling update processes N servers, but one server hangs during health check (takes 5 minutes to respond). The entire deployment stalls waiting. The play won't proceed because serial batch isn't complete. How do you handle slow hosts in rolling updates?

Implement per-host timeouts and skip slow hosts. Add `timeout: 300` to critical tasks: if task exceeds 5 minutes, fail and move on. Implement `ignore_errors: true` on non-critical health checks: if health check timeouts, proceed anyway (assumes server is OK). Implement task-level `async` with shorter poll: `async: 300 poll: 30` waits 30 seconds max per poll iteration, then move on. Implement `max_fail_percentage: 20` to allow 20% of batch to fail/timeout without stopping entire deployment. Implement health check optimization: use simple TCP connection check instead of full HTTP request (faster). Implement pre-deployment validation: run connectivity check before deployment starts, mark unreachable hosts offline. Implement deployment parallelism: use `forks` to process multiple hosts in parallel within batch, reducing batch wait time. Implement selective retry: if host timeouts, tag it and retry in dedicated cleanup play after main deployment. Implement fallback: if host seems slow, skip it and continue, then handle separately later. Use task-level overrides: teams can specify custom timeouts for their applications. Document expected health check time.

Follow-up: How would you implement health check that validates not just availability but application correctness?

Your rolling update strategy is `serial: 100` but teams deploy multiple services simultaneously (web app, backend API, database). These services have dependency chains: database must update first, then API, then web. Currently all services serial independently causing deployment conflicts. How do you coordinate dependent service deployments?

Implement orchestrated multi-service deployment using Tower workflows. Create playbook per service: database-deploy.yml, api-deploy.yml, web-deploy.yml. Create Tower workflow that chains playbooks: 1) run database-deploy, wait completion, 2) run api-deploy, wait completion, 3) run web-deploy. This enforces dependency order. Within each playbook, use `serial` for rolling updates. Implement parallel independent services: run web and cache services in parallel (no dependencies). Implement validation between services: after database deploys, run migration playbook to prepare data. Implement shared inventory: all playbooks read same inventory, targeting same hosts. Implement cross-service health checks: after API deploys, verify it connects to updated database. Create Ansible playbook that orchestrates all deployments: `pre_tasks` for database, `tasks` for API and web in parallel, `post_tasks` for verification. Implement idempotency: if deployment interrupted mid-way, restart from service where it failed. Store deployment state in Tower to track progress. Use Tower workflow conditional branching: on API deploy success, proceed to web; on failure, skip web and remediate API. Document service dependencies clearly so teams understand deployment order.

Follow-up: How would you implement version compatibility checking between services before deployment?

Your rolling update deploys stateful application (e.g., Kafka, Elasticsearch cluster). Updating cluster nodes sequentially risks data loss or cluster outage if quorum is lost. How do you safely update stateful clusters?

Implement stateful cluster update patterns. Pattern 1 (Blue-Green): deploy new cluster in parallel, migrate data, then switch. Pattern 2 (Rolling with quorum): update only minority of nodes while maintaining quorum. For Kafka/Elasticsearch with 5 nodes: update 2 nodes (keep 3 for quorum), then update remaining. Implement cluster-aware health checks: verify cluster health (quorum, replication) after each update. Pre-deployment validation: verify cluster is healthy before starting update. Implement drain before update: for stateful services, drain traffic/partitions off node before updating. For Kafka: reassign partitions away from node before restart. Implementation: `pre_tasks` drain, `tasks` update, `post_tasks` rebalance. Implement partition/shard rebalancing: after update, rebalance data to ensure even distribution. Implement state sync between versions: if upgrading with breaking changes, pre-migrate data. Implement incremental rollout: update one node, wait for cluster rebalance, verify all nodes consistent, update next node. Implement automated state verification: after each node update, verify cluster state unchanged. Implement rollback: keep previous version available for quick rollback if issues detected. Test rolling update process with test cluster first. Document cluster topology and update requirements.

Follow-up: How would you implement leader election strategy during rolling updates of distributed systems?

Your rolling update uses `serial: 100` with 10 batches. Each batch takes 20 minutes. Between batches, you want to pause and allow manual validation: "Does application look good? Proceed with next batch?" Currently deployment is fully automated. How do you add manual approval gates in rolling updates?

Implement approval workflow using Tower workflows and human approval nodes. Create playbook with `serial: 100` for each batch. Instead of single playbook, create Tower workflow: 1) Run batch 1 playbook, 2) Approval node (pause and wait), 3) If approved, run batch 2 playbook, etc. Tower provides UI for approvers to click approve/reject. Implement timer: if approval not provided within 30 minutes, automatically rollback changes. Implement conditional approval: critical batches require approval, routine batches auto-proceed. Use Tower's `approval_notifications` to alert approvers (Slack, email). Implement pre-approval validation checks: before approval node, run health checks, show results to approvers. Approvers review metrics/logs, then decide to proceed. Implement metrics dashboard: show error rates, latency, traffic during batch deployment. Approvers review dashboard before approving next batch. Implement approval quorum: require 2 out of 3 approvers. Implement approval escalation: if approval delayed >5 minutes, escalate to senior approver. Implement audit: log who approved/rejected each batch and when. Document approval criteria: what should approvers look for before approving?

Follow-up: How would you implement automatic rollback if approval is rejected by validators?

Your rolling update handles N server groups with different roles: web servers, app servers, cache servers, database servers. All use same playbook but have different configurations. Using single `serial: 100` treats all roles identically. Web server update can tolerate 50% capacity loss, but cache server update needs >80% capacity available. How do you apply different serial strategies per role?

Implement role-aware serial strategy. Create separate plays for each role with custom serial values: Play 1 `hosts: webservers serial: 50%` (update half of web servers at time). Play 2 `hosts: appservers serial: 10%` (conservative 10% for app servers). Play 3 `hosts: cache_servers serial: 25%`. Play 4 `hosts: database_servers serial: 1` (one database at time). Each play respects role-specific constraints. Implement role-aware health checks: web servers check load balancer metrics, cache servers check cache hit rate, database servers check replication lag. Implement variable `serial_percentage` per role in inventory: web role sets `serial_percent: 50`, cache sets `serial_percent: 10`. Playbook uses variable: `serial: "{{ serial_percentage }}%"`. Implement dependency ordering: update web servers first (stateless), then app servers (depends on web), then cache, then database. Use `serial: 0` (all parallel) only when safe for that role. Use conditional serial: `serial: 50% if not production else 10%`. Test each role's serial strategy independently in staging. Monitor each role during deployment to verify chosen serial strategy is safe.

Follow-up: How would you implement dynamic serial adjustment based on real-time error rates?

Your rolling update with `serial: 100` deploys code to 1000 servers. Deployment is 70% complete (700 servers updated) when a critical bug discovered in new code. You need immediate rollback of all 1000 servers to previous version. Partially updated state is risky. How do you implement safe rollback during rolling updates?

Implement rollback capability and safe partial state handling. Implement snapshot strategy: before deployment, save current version state (symlink to version directory, database version tag, etc.). Rollback: revert symlinks/tags to previous snapshot (instant rollback). Avoid manual version revert per host (slow, error-prone). Implement idempotent rollback: can rollback multiple times safely. Implement data compatibility: ensure new and old code versions can coexist on same database. If schema incompatibility, plan data migration. Implement rollback health checks: after rollback, verify all servers are on same version, services running correctly. Implement emergency override: teams can trigger global rollback via Tower button. This stops ongoing deployment, rolls back all 700 updated servers. Implement partial state detection: track which servers updated, which not. On rollback interruption: only rollback servers that were updated, leave others as-is. Implement zero-data-loss rollback: if new version wrote data, ensure rollback doesn't lose data. Use version-agnostic data formats. Implement communication: notify teams during rollback (Slack, email). Implement post-rollback analysis: determine root cause of bug, plan fix. Implement rollback test: regularly test rollback procedure to ensure it works.

Follow-up: How would you implement feature flags as alternative to rolling back code?