Ansible Interview — Pull vs Push Architecture

Your company runs push-based Ansible from a central control node. The control node becomes a bottleneck: managing 10,000 servers requires significant resources. If control node fails, all automation stops. You're considering switching to pull-based to eliminate this single point of failure. What are the architectural tradeoffs?

Push model: central control node good for coordinated deployments, immediate feedback, strong consistency. Requires control node high availability and sufficient network bandwidth. Pull model: each server runs cron job pulling configuration, decentralized, scales infinitely. Tradeoffs: push gives immediate control (deploy now), pull has variable latency (waits for cron). For 10,000 servers: push requires substantial control node (16+ cores, 64GB RAM). Pull decentralizes load—each server independently retrieves configuration. Hybrid approach: use push for critical/coordinated changes, pull for continuous configuration management. Recommend: push for deployments (want immediate control), pull for compliance/configuration (accept eventual consistency). Implement pull using ansible-pull: servers run `ansible-pull` from cron, retrieving playbook from Git. Git becomes the configuration source. Pull provides automatic remediation: if server drifts, next pull corrects it. Push provides auditability: know exactly when changes occurred. For reliability: implement control node clustering in push model. For pull: implement high-availability Git repository. Choose based on consistency requirements: critical systems need push (immediate), infrastructure can use pull (eventual).

Follow-up: How would you implement hybrid push-pull where critical changes use push and routine updates use pull?

Your organization uses pull-based Ansible with servers executing `ansible-pull` every 5 minutes from Git. Configuration changes are tested but sometimes broken configs are merged to master accidentally. Before broken config reaches 10,000 servers via next cron run, you need rapid rollback. How do you implement safe pull-based deployments?

Implement Git branching strategy: main branch is stable (production-ready), staging branch for testing. Servers pull from specific branch tag, not HEAD. Use Git tagging: create tag v1.2.3 after testing, servers pull from tag. Tags are immutable—changing tag requires manual approval. Implement pull testing: before merge to main, run playbook in test environment. Use CI/CD to validate playbook syntax and lint. Implement pull rate-limiting: servers pull at random intervals (jitter) to avoid thundering herd after config update. Use inventory groups to phase rollout: production servers pull every 5 min, canary servers every 1 min. Monitor canary servers for 5 minutes before broader rollout. For rollback: use Git revert to create new commit, tag as v1.2.4, servers auto-pull new version. Implement health checks after pull: verify service is running post-configuration. If health check fails, playbook generates alert. Use Git hooks to prevent merging broken configs: require successful test run before merge. Implement pull verification: after pull, run `ansible-playbook --check` to verify idempotency before applying.

Follow-up: How would you implement canary deployment in pull model where 10 servers pull first, then gradual rollout?

Your push-based Ansible connects to 5000 servers daily for deployments. Network connectivity is poor in some regions: timeouts are common, deployments fail intermittently. Adding retry logic helps but deployments still fail unpredictably. How do you add resilience to push-based deployments?

Implement push with retry logic: `retries: 3 delay: 10` on critical tasks. Use connection timeout tuning: `ansible_connection_timeout: 30` (shorter timeout fails faster, allowing retry sooner). Implement smart retries: distinguish transient failures (network timeout) from permanent failures (SSH auth failed). Use `failed_when` and `changed_when` to classify failures correctly. Implement exponential backoff: retry after 1s, 5s, 30s to avoid hammering server. For large deployments, use push with batching: `serial: 100` deploys to 100 servers at a time. If batch fails, investigate before continuing. Use push with fallback: if server unreachable via SSH, fallback to alternative (agentless API, SNMP). Implement push with staged rollout: canary (5% of servers), early adopters (25%), main fleet. Monitor each stage before proceeding. Use Tower for push with automatic recovery: Tower has built-in job retries and can reroute to different execution nodes. For poor network regions, deploy local execution nodes in each region to reduce hop count. Use SSH multiplexing to reuse connections: `ControlMaster=auto ControlPersist=60s` reduces connection overhead.

Follow-up: How would you implement push-based deployment with automatic failover to pull when push fails?

Your pull-based system has servers pulling configuration every 5 minutes. A critical security vulnerability requires immediate patching. The next pull won't happen for up to 5 minutes. You can't wait. How do you implement urgent push-override in pull-based system?

Implement out-of-band push mechanism for emergencies. Use messaging queue (AMQP, Kafka) where pull servers subscribe. Normal operation: servers pull every 5 minutes. Emergency: infrastructure team sends message to queue "urgent: security patch", servers pull immediately upon receiving message. Implement webhook trigger: Git hook on emergency branch triggers immediate playbook execution via control node. Servers receive push notification with security patch. Use Tower/AWX callback feature: servers can poll Tower for "is there urgent work?" query that returns immediately. If urgent work available, server executes immediately. Implement dual-channel pull: standard pull every 5 min, also subscribe to emergency channel that pushes immediately. For immediate response (critical systems), consider hybrid: maintain small push control node for emergencies only. Use ansible-pull with `--force` flag for servers to pull immediately when signaled. Implement monitoring alert that triggers immediate pull: alert fires, triggering orchestration to force-pull on affected servers. Document emergency procedures and practice them regularly.

Follow-up: How would you implement prioritized pull where security patches pull before routine updates?

Your push-based deployment must coordinate across three datacenters: US-East, US-West, Europe. Deployments are sequential (East → West → Europe) taking 3 hours total. Teams want faster deployments. However, you need to ensure East is healthy before proceeding to West. How do you optimize multi-region push deployments?

Implement parallel push with validation gates. Use Ansible Tower workflows to execute regions in parallel initially. After each region's deployment, implement health check gate: if East deployment succeeds health checks, proceed with West in parallel (don't wait for West to complete before starting Europe). Use playbook `meta: refresh_inventory` to update inventory with latest deployment status. Implement region-specific inventory groups: `us_east_servers`, `us_west_servers`, `europe_servers`. Each play targets specific group, running in parallel. Add `serial` per region to batch servers: `serial: 25%` batches by region. Between regions, implement validation: check East application health before proceeding. Use Tower job templates per region, chained in workflow. Implement intelligent sequencing: if East deployment fails, pause West. If East succeeds health checks, auto-trigger West. Implement cross-region data validation: verify data replicated to all regions before marking deployment complete. Use async tasks for deployment: start deployment to all regions asynchronously, then wait for all to complete. Monitor regional latency to coordinate optimal deployment timing.

Follow-up: How would you implement automatic rollback if regional deployment fails health checks?

Your pull-based infrastructure uses Git as source of truth. Two teams push conflicting configs simultaneously: Team A updates nginx config, Team B updates firewall rules. Git merge conflict occurs. The system can't resolve the conflict automatically. How do you prevent/handle configuration conflicts in pull model?

Implement Git workflow with code review and branch-based deployments. Teams create feature branches for changes: `git checkout -b team-a/nginx-config`. Each team merges to staging first, tested by CI/CD. Only after testing passes, merge to main. Implement branch protection rules: require code review and successful CI/CD before merge. Implement conflict detection: Git diff against main before merge to catch conflicts early. For simultaneous changes, implement ownership-based separation: Team A owns `/roles/nginx/`, Team B owns `/roles/firewall/`. Teams can modify only their owned directories. Use `git-flow` or similar branching strategy: main is stable, develop is integration point. Conflict resolution: implement CODEOWNERS file specifying who reviews changes to each path. Use Ansible directory hierarchy to enforce separation: `/roles/team-a/`, `/roles/team-b/`. Pull playbook sources both roles, but each team controls their role. For conflicts that do occur: use Git 3-way merge, implement human review requirements. Implement automated testing of merged configs: test combined config in CI before deploying. Use inventory to specify which roles apply to which servers, reducing conflicts.

Follow-up: How would you implement GitOps-style deployment where Git is the sole source of truth?

Your company's push-based deployment requires strict change control: all production changes must be approved, scheduled during maintenance windows, and have rollback plans. Currently this is manual: teams create tickets, get approvals, schedule deployments. How do you automate this workflow?

Implement Tower workflow automation for change control. Create workflow: 1) Change request submission (via form or API), 2) Approval gate (requires security/ops sign-off), 3) Scheduling (allow execution only during maintenance window), 4) Pre-deployment validation (syntax check, test environment validation), 5) Deployment, 6) Post-deployment health check, 7) Rollback if health check fails. Integrate with ticketing system: Ansible deployment must reference valid change ticket. Use Tower credentials to control who can approve changes: only ops team members can approve prod changes. Implement approval workflow: change submitted to Tower, generates alert to approvers, they click approve in Tower UI. Implement scheduling: Tower workflow has scheduled start time, won't execute outside maintenance window. Use Tower's built-in approval nodes in workflows. Implement pre-flight checks: before deployment, verify system is in expected state. Implement health check gate: after deployment, run smoke tests. If tests fail, auto-trigger rollback playbook. Implement audit trail: log all approvals, changes, and rollbacks to compliance system. Use Slack/Teams integration to notify teams of change progress. Implement dry-run capability: teams can test deployment in staging, verify health checks pass before production.

Follow-up: How would you implement emergency change procedures that bypass normal approval workflow?

Your push-based Ansible deployment scales to 50,000 servers globally. A single control node can't handle this scale. You implement Tower with multiple execution nodes. Now deployment coordination is complex: how do you ensure consistency across globally distributed execution?

Implement distributed execution with coordinated state management. Use Tower's instance groups to organize execution nodes by region: `us-east-nodes`, `us-west-nodes`, `europe-nodes`. Each group runs deployments to its region independently. Implement Tower workflows that coordinate across regions: execute US deployment, wait for validation, then execute Europe. Use Tower's shared database for state: all Tower instances read/write to centralized PostgreSQL, ensuring consistency. Implement deployment locking: Tower workflow acquires lock on infrastructure being modified, preventing concurrent deployments. Use playbook variables to handle regional differences: `region: us-east` drives conditional logic. Implement inventory mirroring: each execution node has copy of inventory (cache), synced from Tower database. Implement checkpoint-based deployment: save deployment state after each critical step. If partial failure occurs, replay from checkpoint. Use Tower webhooks for inter-region coordination: each region's deployment completion triggers next region's deployment. Implement distributed tracing: track deployment progress across all execution nodes. Use Tower API for programmatic deployment coordination: monitor job progress across all nodes, aggregate results.

Follow-up: How would you implement disaster recovery where entire region's execution infrastructure fails?