Your Ansible playbook deploys to 5000 servers and takes 2 hours to complete. Most time is spent gathering facts from each server (30 seconds per 500 servers). The playbook doesn't need all facts. How do you optimize fact gathering?
Use `gather_subset: min` to collect only basic facts (hostname, kernel, network interfaces) instead of all facts. This reduces gathering time from 30 seconds to 2 seconds per 500 servers. Use `gather_facts: no` and explicitly gather facts using `setup` module only for specific tasks that need them. Implement fact caching with Redis backend: `fact_caching: redis` and `fact_caching_timeout: 3600`. First playbook run gathers and caches facts; subsequent runs reuse cache unless timeout expires. Use `gather_timeout: 10` to fail fast on slow/unreachable hosts rather than waiting. For playbooks that don't need facts at all, disable fact gathering completely: `gather_facts: false`. Use `register` to capture facts only once and reference multiple times. Implement task-specific fact gathering: some tasks use `gather_subset: network` only. Use `setup` filter in playbook to gather only specific facts: `setup: filter=ansible_eth*`. Monitor fact gathering time and alert if time increases (indicates infrastructure issues).
Follow-up: How would you implement fact caching across multiple playbook runs with cache invalidation?
Your playbook loops over 10,000 items (hosts, packages, users). Loop overhead causes execution to slow significantly. Simple math: 10,000 iterations × 1 second per task = 2.7 hours. How do you optimize looping?
Use bulk operations instead of loops where possible. Instead of `loop: {{ packages }}` to install each package individually, use package manager's native bulk syntax: `name: "{{ packages }}"` for apt/yum. Use `with_items` only when necessary—modern Ansible prefers `loop` but batch loops are faster. For network-intensive operations, use `async` with `poll: 0` to execute loop items asynchronously: `async: 300 poll: 0` then `async_status` to collect results. Use `serial` to batch loop items: `serial: 100` processes 100 items, commits, then next batch. This provides checkpoint capability. Implement stratified execution: use `forks: 50` to increase parallelism. Use `include_tasks` in loops instead of task-level loops for cleaner code. For data processing, use filters instead of loops: `{{ items | map(attribute='name') }}` instead of looping. Use `vars:` to pre-compute values before loops to avoid redundant calculations. Profile loop performance: if processing 10,000 items takes >10 minutes, bottleneck is likely not looping but something else (network, I/O).
Follow-up: How would you implement progress tracking for long-running loops with 10,000+ items?
Your Ansible playbook uses `register` to capture output from every task. This registers massive dicts that consume memory and slow playbook. A playbook running on 5000 servers with 100 tasks creates millions of registered variables. How do you optimize memory usage?
Use `register` selectively—only register variables that are actually used by subsequent tasks. When registering, use `register: var changed_when: false` to avoid storing unnecessary state data if the task doesn't change anything. Use filters to capture only needed data: `register: result` then immediately extract: `set_fact: result_code: "{{ result.rc }}"` and discard the large result. Use `ignore_errors: true` combined with conditional to skip registration for expected failures. Implement dynamic unregistering: delete variables after use with `set_fact: var=null` or `vars:` scope. Use `no_log: true` to prevent storing secrets in registered variables. For large registrations, disable variable persistence: set `gather_stats: false` in ansible.cfg. Use external data stores for temporary results: write to files instead of registering, then read back. Implement memory monitoring: if Ansible process grows >1GB, investigate unnecessary registrations. Use `--diff` selectively only on critical plays to reduce output data. Profile memory usage: `ansible-playbook playbook.yml --profile-tasks` shows resource usage.
Follow-up: How would you implement garbage collection for long-running playbooks to prevent memory leaks?
Your playbook runs on 10,000 servers with `forks: 100` (100 parallel tasks). On 5000 servers, performance is good; on 10,000 servers, performance degrades catastrophically. Too many simultaneous SSH connections saturate control node network. How do you optimize for scale?
Implement adaptive parallelism: start with `forks: 50` and increase incrementally if control node can handle it. Use connection pooling to reuse SSH connections: `pipelining: true` in ansible.cfg reduces connection overhead. Implement local execution where possible: use `delegate_to: localhost` for tasks that don't need remote execution. Use `serial: 0` (execute all hosts in parallel, unlike `serial: 100` which batches). Implement task batching: group related tasks into single `block` to reduce connection overhead. Monitor control node resources during playbook execution: CPU, memory, network. If resources saturated, reduce `forks` or increase control node capacity. Use execution nodes: distribute playbook execution across multiple control nodes reducing per-node load. Implement playbook distribution: split 10,000 servers into groups, run separate playbooks per group. Use Tower with instance groups to distribute execution. Profile SSH connection overhead: test with `forks: 50, 100, 200` to find optimal value. Use `ansible_connection: local` for infrastructure-only tasks (not requiring SSH).
Follow-up: How would you implement dynamic fork tuning based on real-time system metrics?
Your playbook deploys application updates to production: first runs on canary server (should take 5 min), then 1000 servers (should take 30 min). Currently it takes 2 hours. Playbook waits for all 1000 servers to reach exact same state before proceeding. Sequential waiting causes delays. How do you optimize deployment strategy?
Use `serial` strategy instead of all-at-once: `serial: [1, 100, 500, 'rest']` deploys to canary first (1 server), waits for validation, then to early adopters (100), then to main fleet (500), then rest. This provides checkpoint capability. Use `max_fail_percentage: 10` to fail playbook if >10% of batch fails, preventing full fleet deployment on errors. Implement health checks between batches: after each serial batch, verify deployment health before proceeding. Use async tasks to improve parallelism: instead of waiting for all 100 servers to update before continuing, use `async: 300 poll: 0` then `async_status` to collect results. Implement rolling updates with `serial: 10` for canary deployment on subset first. Use `any_errors_fatal: false` to allow partial batch failures without stopping entire playbook. Implement per-host health monitoring: track which servers are healthy after update. For application deployments, implement blue-green pattern: deploy to new servers (blue), then switch traffic from old servers (green). This eliminates sequential waiting—switch is instant.
Follow-up: How would you implement canary deployment analytics to determine if canary passed validation automatically?
Your Ansible control node has 200GB RAM, but playbook runs use <5GB while waiting. Storage I/O is the bottleneck: task execution waits for file operations, template rendering, and remote file copying. How do you optimize storage I/O?
Use SSD storage for Ansible working directory and fact cache to reduce I/O latency. Implement tmpfs (in-memory filesystem) for Ansible temp directories: speeds up template rendering and file operations. Use `/dev/shm` for temporary files instead of disk: `ANSIBLE_REMOTE_TMP=/dev/shm/ansible`. Configure file transfer method using `ssh_transfer_method: sftp` or `scp` based on performance testing. For large file transfers, use `copy` module with `compress: yes` to reduce data volume. Implement Ansible fact caching to Redis (in-memory, very fast). Use `register` strategically to avoid redundant file reads. For template rendering, pre-compute values in playbook rather than in templates (which might read files repeatedly). Implement local file caching: use `fetch` to cache files locally after first retrieval, then reference cache. Use `stat` module to check file state without expensive operations. For large directory trees, use `find` with filters to limit results. Profile I/O operations: run playbook with `strace` to identify expensive I/O patterns.
Follow-up: How would you implement parallel file transfer for deploying thousands of files?
Your playbook uses Tower for automation. Tower job execution spans 500 servers but 450 servers have no tasks (filtered out by conditionals). Playbook still spends 30 seconds connecting to these servers for no reason. How do you avoid connection overhead for filtered hosts?
Use `pre_tasks` to filter inventory before playbook execution. Pre-task: use `meta: refresh_inventory` and dynamically filter inventory: `group_by: key=server_role` to group by role, then play targets specific group. Use `hosts: webservers` instead of `hosts: all` to target only needed servers. Implement inventory filtering at Tower level: use Tower inventory filters to exclude unnecessary hosts at job launch time. Use `add_host` and `group_by` to dynamically create targeted groups during playbook execution. Implement host filtering via conditional: `when: server_role == 'webserver'` but this still connects to host. Better: use `meta: select_hosts` (if available in version) to deselect hosts before connection. Implement Tower smart inventory: create smart filters that automatically group servers by criteria. Use playbook variables to filter: playbook takes parameter `--extra-vars target_role=webserver` and uses that in `hosts:` selector. Pre-compute target host list: run quick inventory query to identify target hosts, then launch playbook against specific list. Implement inventory caching: cache filtered groups so subsequent playbooks don't re-filter.
Follow-up: How would you implement query-based inventory filtering for complex multi-criteria selection?
Your Ansible playbook calls external APIs (monitoring, ticketing, deployment systems) during task execution. A slow API call blocks entire playbook. One 30-second API call serializes 100 playbook tasks. How do you optimize external system integration?
Use async tasks for API calls: `async: 300 poll: 0` sends request and returns immediately without waiting. Continue with other tasks. Use `async_status` later to collect API results: `register: api_result` with `until: api_result.finished`. Implement API call batching: instead of 100 individual calls, batch into 10 bulk calls. Use webhook pattern: instead of playbook calling API, playbook emits event and returns immediately. External system processes event asynchronously. Implement API response caching: if same API call made twice, return cached result instead of calling again. Use API timeouts: if API doesn't respond in 10 seconds, timeout and continue (fallback behavior). Implement circuit breaker: if API fails repeatedly, stop calling it rather than retrying. Use local queuing: queue API calls locally, background job processes queue asynchronously. Implement parallel API calls using Tower workflows: create parallel job templates for API calls. For critical APIs, implement fallback: if primary API fails, use secondary data source or cached result.
Follow-up: How would you implement automatic performance degradation when external systems are slow?