Ansible Interview — Roles, Collections, and Galaxy

Your team maintains 50 Ansible playbooks that all need the same nginx role, but each team member has their own copy in their playbook directories. This causes maintenance nightmares: updates need to be manually copied to 50 locations. How do you centralize and version this shared role?

Create an Ansible Collection to encapsulate the nginx role with all dependencies. Structure it as `collections/ansible_collections/myorg/common/roles/nginx/`. Publish to an internal Artifactory or private Galaxy server. Define version in `galaxy.yml` and use semantic versioning (2.1.0). Reference it in all playbooks via `requirements.yml`: `- name: myorg.common`. Install via `ansible-galaxy collection install -r requirements.yml`. Implement CI/CD to test the collection (syntax, lint, Molecule tests) on every commit. Store galaxy.yml and changelog in version control. This enables teams to reference the role consistently: `- name: nginx role is: myorg.common.nginx`. Implement deprecation notices in collection metadata for breaking changes. Monitor collection downloads/usage through Galaxy analytics.

Follow-up: How would you implement zero-downtime updates to a production collection that multiple teams depend on?

You need to use 30 community roles from Galaxy (nginx, postgresql, redis, etc.), but each has different quality levels, maintenance schedules, and potential security vulnerabilities. How do you safely integrate external Galaxy content into production?

Implement a curated role acceptance process. Use `requirements.yml` to pin exact versions of all community roles. Never use `latest` or unversioned constraints. Implement CI/CD scanning: 1) run `ansible-lint` against all roles to identify issues, 2) check for known CVEs in role dependencies, 3) run Molecule tests to verify role behavior in your environment. Create wrapper roles around community roles that add your organization's requirements (monitoring, logging, compliance). Store all roles internally after vetting—don't pull from Galaxy at deployment time. Implement quarterly security audits of all roles. For critical roles (security-related), fork to your internal repo and maintain patches. Use role versioning with changelog requirements—community maintainers must document breaking changes. Monitor GitHub/community role updates and security announcements for critical fixes.

Follow-up: How would you handle role updates when a community role introduces a breaking change?

Your company is scaling from 5 teams to 50 teams, each needing similar infrastructure roles but with team-specific customizations. Creating 50 copies of roles defeats the purpose. How do you design roles for multi-tenant environments?

Design roles as parameterized templates using role variables with sensible defaults. Example nginx role variables: `nginx_port: 80`, `nginx_config_template: default.conf.j2`, `nginx_modules: []`, `nginx_ssl_enabled: false`. Teams override in their playbooks or group_vars. Create collections that contain multiple related roles and plugins working together. Implement a collection hierarchy: base collection (core infrastructure), team collection (team-specific extensions). Use role dependencies to compose functionality: a web-app role depends on nginx and postgresql roles. Implement Ansible plugins within collections for complex logic. Use variable inheritance through group_vars hierarchy so team-level configs override defaults. Document role variables extensively with examples. Implement configuration validation in pre_tasks to catch misconfiguration early. Version collections for breaking change management.

Follow-up: How would you test 50 team-specific configurations against role updates without manual testing?

A community role you depend on hasn't been updated in 2 years, has critical security issues, and the maintainer is unresponsive. Your team is considering forking it, but maintaining a fork long-term is expensive. What's your decision framework?

Implement a structured decision: 1) Check if vulnerabilities are exploitable in your usage (might be acceptable risk), 2) Assess maintenance burden: if <10 lines of patch code, fork internally and maintain it. If complex, consider replacing with alternative role. 3) Contribute patches upstream via PR—if rejected, fork is justified. When forking: create internal collection in your Artifactory, apply necessary security patches, add comprehensive tests, pin exact version in requirements.yml. Set reminder to monitor original repo for new versions. Consider reaching out to maintainer offering to co-maintain or asking for transfer to organization. If no response after 30 days, proceed with fork. Document fork reasoning in README. After forking, prioritize replacing with maintained alternative in roadmap. Implement automated alerts for security vulnerabilities in forked roles.

Follow-up: How would you structure a contribution back to the community when you've improved a forked role?

Your Ansible Tower deployment syncs 200 collections from Galaxy during initialization. This takes 8 minutes, blocking new job launches during scale-up events. How do you optimize collection downloads?

Implement a tiered caching strategy. Use a private Artifactory or internal Galaxy mirror to cache collections locally. Pre-download all collections during Tower container build rather than runtime—bake collections into the container image. Implement incremental syncs: only download updated versions. Use `collections_on_ansible_version_mismatch: warning` to avoid re-downloading on minor version changes. Configure connection pooling and compression in collection download. Split large Tower environments across multiple instances with shared collection cache. Use `galaxy.yml` to specify exact versions and skip unnecessary dependencies. Implement parallel collection installation by sharding requirements.yml. Monitor collection download latency—if individual downloads exceed 5 seconds, investigate upstream connection issues. Consider using Artifactory's P2P cache distribution for large deployments. Pre-cache collections on all Tower instances before peak traffic.

Follow-up: How would you implement collection versioning strategy when teams require different versions of the same collection?

Your collection contains 200 roles and 50 plugins. A junior engineer accidentally includes malformed code in a plugin that breaks all playbooks using the collection. How do you implement quality gates to prevent this?

Implement comprehensive CI/CD for collections. Every PR must pass: 1) `ansible-lint` at strict level (fail on warnings), 2) `ansible-test` for unit tests, 3) `ansible-playbook --syntax-check` on all playbooks, 4) Molecule tests for each role validating functionality, 5) security scanning (bandit for Python code in plugins), 6) documentation coverage checks. Require peer code review from senior engineers. Implement pre-commit hooks to catch basic issues before PR. Use Galaxy collection metadata validation. Implement versioning: only release tagged versions to Galaxy, beta versions for testing. For critical collections, implement staged rollout: release as beta first, monitor usage for 1 week, then promote to stable. Implement automated testing against multiple Ansible versions. Use `ansible-galaxy collection build` with validation before publishing.

Follow-up: How would you rollback a collection version if a critical bug is discovered after release?

Your Ansible deployment requires custom modules that don't exist in community Galaxy. The team debated whether to create plugins, modules, or roles to encapsulate this logic. What's your architectural decision framework?

Use this decision tree: 1) If it's orchestration/conditionals/workflow logic → task/role, 2) If it modifies target system state → module, 3) If it processes/filters data without target system interaction → plugin. Example: custom module if deploying company-specific application, plugin if transforming JSON configuration, role if orchestrating multiple steps. Create modules inside collections with full Python testing. Implement modules as idempotent with state checking. Create lookup plugins for data fetching (API calls), filters for data transformation, callbacks for custom logging. Use shell/command modules sparingly—prefer Python modules for reliability and cross-platform support. Structure collection with `plugins/modules/`, `plugins/lookup/`, `plugins/filter/` directories. Document required Python dependencies in collection meta. Version modules like roles—never break existing playbooks in minor updates.

Follow-up: How would you test a custom module across multiple Ansible versions without manual testing?

Your team maintains 3 internal collections used by 200 playbooks. A breaking change in collection A breaks 80% of dependent playbooks. You need a migration path. How do you handle this gracefully?

Implement a deprecation lifecycle: 1) Release v2.0 with deprecation warnings in v1.5, 2) Document breaking changes with migration examples, 3) Provide v1.5-compat branch that remains stable. Use feature flags in collection code to support both old and new syntax temporarily. Create migration guide with before/after examples. Implement automated code migration tools if possible (e.g., Python script that updates requirements.yml). Bump major version (v2.0) for breaking changes. Support v1.x for 6 months minimum—patch security issues. Communicate breaking changes 2 sprints in advance through team meetings. Implement Tower notifications alerting teams when collection updates are available. Provide examples of migrating dependent playbooks. For critical changes, consider providing a v1-to-v2 adapter collection that acts as a shim. Test impact of breaking changes against internal playbook suite before release.

Follow-up: How would you implement automatic collection updates without breaking dependent playbooks?