Ansible Interview — Execution Model and SSH Connection Plugins

Your infrastructure team is deploying Ansible across 500 Linux servers in 3 datacenters with intermittent SSH connectivity issues causing task failures on 5-10% of nodes. You need to ensure all tasks complete reliably without rerunning the entire playbook.

Implement connection pooling with pipelining = True in ansible.cfg to reduce SSH handshake overhead. Use forks parameter (default 5) increased to 10-15 for parallel execution. Add connection_retries and timeout settings: ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s. For transient failures, wrap critical tasks with retries: 3 and delay: 5. Monitor with ANSIBLE_DEBUG=1 and callback plugins. Use serial: [1, 5, 10] strategy for rolling batches—start with 1, expand to 5 after success, then 10. This isolates intermittent node failures and prevents cascading issues.

Follow-up: How would you handle a scenario where SSH keys rotate every 30 minutes on sensitive nodes, and how would you verify the playbook completes without key expiration causing mid-run failures?

A production deployment of 200 Windows servers requires mixed Ansible orchestration for configuration management. Your team uses SSH for Linux but needs WinRM for Windows with different authentication strategies. Current setup causes timeout errors on half the Windows connections during peak hours.

Configure WinRM transport separately with ansible_connection=winrm per inventory host group. Set explicit WinRM ports and authentication: ansible_port=5985, ansible_winrm_transport=ntlm or kerberos. Increase WinRM buffers: ansible_winrm_send_extra_requests=true and ansible_winrm_read_timeout_sec=60. Use ansible_async_dir on Windows to prevent temporary file conflicts. Implement async tasks for long operations: async: 3600 with poll: 30. Create separate connection plugin configs in ansible.cfg under [defaults] for connection timeouts. Monitor with event-driven callbacks to capture connection state. Use block with handlers to manage WinRM session cleanup between plays.

Follow-up: If your organization rotates WinRM credentials via an external API every 6 hours, how would you dynamically refresh the ansible_password variable without reloading the inventory?

Your Ansible control node processes 50 concurrent playbooks daily, each managing 20-50 hosts. Users report variable execution times (2 minutes to 15 minutes for identical plays). You need to identify and fix the bottleneck while maintaining security isolation between playbook runs.

Profile execution with ANSIBLE_PROFILE_TASKS=True callback plugin to identify slow tasks. Check SSH connection reuse: enable ControlMaster=auto with 30-60 second ControlPersist. Optimize forks: benchmark CPU/memory on control node, start with forks=20 and increase until CPU hits 70% or memory 60%. Use gather_facts: false when unnecessary, cache facts with fact_caching=jsonfile and fact_caching_timeout=86400. For Python interpreter overhead, set ansible_python_interpreter=/usr/bin/python3.11 explicitly per host. Implement connection pooling across playbook runs using persistent connections. Monitor with ansible_connection_pipelining to reduce command execution calls. Separate CPU-intensive operations: delegate heavy lifting to target hosts with local_action.

Follow-up: How would you handle a scenario where one host takes 8 minutes to execute a task while others finish in 30 seconds, without knowing which host is problematic until runtime?

Your organization runs Ansible from a bastion host to manage 1000+ servers across air-gapped networks. Direct SSH access is restricted; all traffic routes through jump hosts. Current setup has 30% SSH timeout failures during backup windows when jump host CPU maxes out.

Implement SSH proxy_command via ansible_ssh_common_args='-o ProxyCommand="ssh -W %h:%p bastion.internal"' per host group in inventory. Use connection multiplexing: ControlMaster=auto, ControlPath=/tmp/ansible-%h-%p-%r, ControlPersist=600. Add StrictHostKeyChecking=no in ansible.cfg only for trusted networks. For air-gapped scenarios, use ansible_connection=ssh with explicit jump host chains: ProxyJump=bastion1,bastion2. Monitor jump host load: if CPU > 80%, reduce forks dynamically or use serial: 10 strategy. Implement health checks: before playbook runs, verify bastion connectivity with ping module and conditional skips. Cache SSH sessions with persistent connection plugin: ansible_connection=persistent with command_timeout=30.

Follow-up: If your jump hosts are geographically distributed and connection latency varies from 50ms to 500ms, how would you automatically select the optimal path and handle failover to secondary proxies?

A critical 24/7 system requires real-time playbook execution with sub-second latency. Standard Ansible SSH overhead causes 200ms delays per task. Your team needs to execute 50+ tasks reliably without sacrificing security or auditability.

Use pipelining=True to reduce round-trips to 1 per task. Combine with fact_caching=redis to avoid repeated fact gathering across playbook runs. Implement async tasks with high poll intervals for I/O-bound operations. Use strategy_plugins to load custom strategies optimized for low-latency: batch strategy or linear with aggressive caching. Pre-compile Jinja2 templates on control node using template_validation_on_compile: true. Use local_action to run tasks on control node, avoiding SSH overhead entirely for non-remote operations. Implement persistent SSH multiplexing: ControlMaster=auto with ControlPersist=3600 to reuse connections across 60+ minutes. For auditability, integrate callback plugins that log to syslog/ELK without blocking task execution. Test with production-like latency using tc (traffic control) to inject delays and verify thresholds.

Follow-up: If your infrastructure requires cryptographic verification of every task execution and you need less than 100ms overhead per verified task, how would you design the verification callback to avoid blocking?

Your Ansible deployment spans on-prem data centers and 3 cloud providers (AWS, Azure, GCP). Network policies restrict inter-cloud SSH. You need to manage all infrastructure from a single control node using a unified playbook that handles region-specific latency and regional network policies.

Use inventory_plugins to dynamically source hosts from each cloud provider via API, with region-specific groups and vars. Set ansible_connection=smart which defaults to SSH but falls back to local execution if needed. Implement per-region connection settings in inventory vars: [aws_region_1:vars] with ansible_ssh_common_args='-o ConnectTimeout=10 -o StrictHostKeyChecking=no'. Use separate proxy groups per region: ansible_bastion_host can vary per environment. Set async: 300 and poll: 10 for cross-region tasks to handle 20+ second latency spikes. Implement strategy_plugins to execute tasks in parallel within regions but serial across regions. Monitor with ANSIBLE_LOG_PATH=/var/log/ansible-multiregion.log and parse region-specific metrics. Use meta: reset_connection between region switches to clear stale connection state.

Follow-up: If cloud provider APIs rate-limit host discovery to 100 requests/minute and your infrastructure has 3000 hosts across providers, how would you design inventory caching and refresh logic without exceeding limits?

Your security team requires SSH key rotation every 30 days. After rotation, existing Ansible connections fail because the old key is invalidated. You need a zero-downtime key rotation strategy for 500+ production servers managed by daily Ansible playbooks.

Implement key rotation with overlapping validity periods: add new keys 5 days before old key expiration. Set StrictHostKeyChecking=accept-new in ansible.cfg to auto-accept new host keys during transition. Use AuthorizedKeysCommand on target hosts to fetch keys from central key store (HashiCorp Vault). In Ansible, implement credential rotation via set_fact tasks that update ansible_ssh_private_key_file mid-playbook from external secret store. Create pre-flight playbook that validates SSH connectivity before main playbook; if connection fails, trigger automatic key refresh from backup key store. Log all key rotation events to syslog for audit compliance. Test with Molecule before production: create temporary container with old key, rotate, verify connectivity persists. Use ansible_connection=persistent to maintain long-lived connections through key rotation events.

Follow-up: If SSH key rotation triggers an emergency revocation (security incident) and you need to switch all 500 servers to a new key within 60 seconds, how would you design the emergency switchover without blocking ongoing playbooks?

Your Ansible control node uses a shared SSH key file with 1000+ developers in a large org. You're required to move to individual SSH keys per developer for security compliance. Current playbooks hard-code ansible_ssh_private_key_file, making per-developer key routing impossible without rewriting thousands of tasks.

Implement SSH agent-based key management: set ansible_ssh_private_key_file to empty in playbooks, rely on SSH agent for key discovery. Developers load their keys into SSH agent before running playbooks: ssh-add ~/.ssh/personal-key. In ansible.cfg, set ssh_args=-o PubkeyAuthentication=yes without specifying key file. Ansible will try all keys loaded in agent. For CI/CD, use SSH_AGENT_SOCK environment variable or configure Ansible to use specific SSH identity: IdentityFile=~/.ssh/developer-%{USER} in ansible.cfg. Implement role-based key discovery: use ansible_user variable to derive key path: /home/{{ ansible_user }}/.ssh/ansible_ed25519. Create a key management script that validates key ownership and permissions before playbook execution. For multi-tenant playbooks, use Vault to store per-tenant key assignments and decrypt dynamically. Test with ansible-inventory --list to verify correct keys are selected per host.

Follow-up: If your compliance team requires audit logs showing which developer executed which task on which server, and SSH agent forwarding is disabled for security, how would you implement impersonation logging without storing credentials?