Your organization has 50 microservices, each with its own CI/CD workflow. They're 90% identical: checkout, install dependencies, run tests, build Docker image, push to registry. You have 50 near-duplicate YAML files, and a recent security update requires changes in all 50 workflows. It took the team 8 hours to update them all. How do you prevent this from happening again?
Use reusable workflows. Create a single `.github/workflows/common-ci.yml` file in a central repo (or `.github/workflows/` in your org config repo) that contains the shared build pipeline logic. Each microservice's workflow calls it: `jobs: build: uses: org/repo/.github/workflows/common-ci.yml@main`. Reusable workflows accept inputs (`inputs: node-version: default: "18"`) and secrets (`secrets: inherit` or explicit mapping). Benefits: (1) single source of truth—update once, all services inherit the change, (2) consistency—no drift between workflows, (3) governance—org can enforce required checks without modifying individual repos. When GitHub updates the Actions runner, or you need a security patch, change it in one place. The downside: debugging is harder (you're calling another workflow), so reusable workflows should be simple—orchestrate complex logic elsewhere.
Follow-up: How would you version a reusable workflow so teams can opt-in to updates gradually instead of immediately?
You created a reusable workflow that all 50 services depend on. You made a breaking change (renamed an input parameter) and committed to the main branch. All 50 services suddenly fail because they're calling a parameter that no longer exists. This breaks the entire CI/CD pipeline for an hour. What's your policy going forward?
Never make breaking changes to reusable workflows without versioning. Use GitHub tags or branches: (1) Semantic versioning: tag releases (v1.0.0, v1.1.0, v2.0.0). Callers explicitly reference versions: `uses: org/repo/.github/workflows/common-ci.yml@v1.0.0`. (2) Major versions get separate tags: keep v1 fixes on the v1 branch, v2 on main. (3) Deprecation period: when renaming a parameter, support both old and new names for 2 releases, warn in job output, then remove in the next major version. (4) Test coverage: treat reusable workflows like library code—create test workflows that verify backward compatibility. For example, test that v1.0.0 still works alongside v1.1.0. (5) Communication: announce breaking changes in your team Slack, give 2 weeks notice before removal. For rapid rollback: services can pin to a specific tag or commit SHA instead of tracking main, giving them control over timing.
Follow-up: Design a testing strategy that verifies a new reusable workflow version doesn't break existing callers.
You're composing a complex deployment workflow from 4 reusable workflows: auth (OIDC), build (Docker), deploy (Kubernetes), notify (Slack). Each has its own checkout, environment setup, and error handling. A deployment fails in the middle, and you get 8 separate Slack notifications—one from each step. The notifications are out of sync, making debugging harder.
Reusable workflows don't share state—each runs in isolation. To coordinate, use job outputs and conditional logic: (1) Have the "auth" workflow output credentials or status, the "build" workflow consumes that output and passes it forward, etc. Use `needs: [auth, build, deploy]` to create a dependency chain and `${{ needs.build.outputs.image-tag }}` to pass data. (2) For error handling: create a single "notify" job at the end with `if: always()`, which checks the status of upstream jobs: `if: failure()` sends a failure notification with all context. (3) Alternatively, create a meta-reusable-workflow that orchestrates the 4 sub-workflows and handles notifications centrally—the parent workflow owns the notification logic. (4) For non-blocking notifications, emit a structured event (e.g., JSON to a Kafka topic or webhook) that a centralized notification service consumes, correlates, and sends a single message.
Follow-up: How would you implement a reusable workflow that accepts a list of "hook" workflows to run conditionally before/after the main build?
You're using reusable workflows for Node.js services. One service needs Java 17 installed in the build step, but the reusable workflow assumes Node.js only. You want to avoid duplicating the entire workflow just to change one line. What's your approach?
Reusable workflows should be flexible without becoming bloated. Options: (1) Add inputs to the reusable workflow: `inputs: extra-setup-steps: type: string description: 'Custom setup commands'`. In the workflow, run `run: ${{ inputs.extra-setup-steps }}`. Callers pass custom setup: `with: extra-setup-steps: 'sudo apt-get install openjdk-17-jdk'`. (2) Use a matrix: `strategy: matrix: runner: [node, java, python]` and conditionally run steps: `if: matrix.runner == 'java'`. But this is messy. (3) Better: create a plugin/hook system. Use an input for a setup action: `inputs: setup-action: default: 'actions/setup-node@v4'`. The reusable workflow runs it. Callers override: `with: setup-action: 'actions/setup-java@v4'`. (4) Simplest: accept a broad input like `runtime-versions` and parse it inside the workflow: `with: runtime-versions: 'node=18,java=17'`. The workflow sets up both. The key is avoiding too many inputs—if a workflow has >5 inputs, it's probably doing too much and should be split.
Follow-up: Design a reusable workflow that supports arbitrary runtime setups (Node, Java, Python, Go) without forking logic per language.
Your team uses a reusable workflow for deployments. The workflow accepts a secret input (AWS credentials). During testing, a junior engineer accidentally typed the secret value into the workflow call: `secrets: aws-key: 'AKIA....'`. The value is now in the workflow file, potentially exposed in git history and CI logs.
Immediately: (1) invalidate the AWS credentials, (2) rewrite git history to remove the secret using `git filter-repo`, (3) audit CloudTrail for any unauthorized actions. For the future: (1) secrets should never be passed as literal values in YAML—always use GitHub Secrets or federated authentication (OIDC). The correct syntax is `secrets: aws-key: ${{ secrets.AWS_KEY }}`, which references a secret stored in GitHub's vault, not the value itself. (2) GitHub masks secret values in logs, but only if they're stored in GitHub Secrets. Custom values passed inline aren't masked. (3) Implement a pre-commit hook: regex scan for suspicious patterns (`AKIA`, `-----BEGIN PRIVATE KEY`, etc.) and reject commits. (4) For reusable workflows, document clearly: "Never pass credentials as literal values; always use GitHub Secrets." (5) Use branch protection rules requiring code review—a reviewer would catch hardcoded secrets.
Follow-up: How would you implement a GitHub Action that scans workflow YAML files for accidentally-hardcoded secrets?
You're composing a multi-step deployment: Stage 1 builds artifacts (2 hours), Stage 2 deploys to staging (30 min), Stage 3 deploys to production (30 min). Each stage is a separate reusable workflow. Stage 1 generates an artifact ID. When Stage 2 fails partway through, the job stops. When you manually re-run Stage 2, it fails again because it can't find the artifact from Stage 1 (which already ran days ago and was cleaned up).
Reusable workflows don't share artifacts by default—each job gets its own workspace. To chain workflows with artifact persistence: (1) Export the artifact ID from Stage 1 as a job output: `outputs: artifact-id: ${{ steps.build.outputs.artifact-id }}`. (2) In Stage 2, consume it: `needs: [build]` and use `${{ needs.build.outputs.artifact-id }}`. (3) If the artifact is a file (not just metadata), don't delete it after Stage 1; store it in persistent storage (S3, artifact registry, container registry). Stage 2 downloads it by ID. (4) For re-runs: keep artifacts alive for 30+ days (GitHub default is 90). If re-run happens after artifact expiry, fail gracefully with a clear message: "Artifact expired; re-run Stage 1 to rebuild." (5) Better architecture: store build artifacts in a container registry (Docker Hub, ECR, GCR) indexed by commit SHA. Stages reference the same registry image; it persists independent of GitHub's artifact cleanup.
Follow-up: Design a multi-stage deployment workflow that gracefully handles artifact expiry and allows manual re-runs after days of inactivity.
Your org has 100+ reusable workflows across 3 repos (.github, terraform-modules, deployment-platform). Engineers struggle to discover which workflows exist, what inputs they accept, and which version to use. A new team onboarded last month and accidentally wrote their own CI workflow because they didn't know a reusable one existed. You need a discovery mechanism.
Create a centralized registry: (1) Document all reusable workflows in a README or wiki. Include: workflow name, purpose, inputs/outputs, version, example usage. (2) Use a simple format: maintain a YAML registry file in your main .github repo listing all workflows and their schemas. Reference it in onboarding docs. (3) Build a discovery action: create a custom GitHub Action that queries this registry and prints available workflows + usage examples. New teams run it during setup: `actions/log-workflows-available@v1`. (4) Implement workflow naming conventions: prefix reusable workflows with `_` (e.g., `_build.yml`, `_deploy.yml`) so they're visually distinct from job workflows. (5) Add rich metadata: include a `description`, `author`, `last-updated`, and `usage` field in the workflow comments. (6) Automate docs: use a script to extract metadata from workflow files and auto-generate a discovery page that's kept in sync.
Follow-up: Design a GitHub Action that validates reusable workflow syntax and auto-generates documentation from workflow metadata.
You have a reusable workflow for Docker builds that's used by 30 services. It builds the image, scans for vulnerabilities (via Trivy), and pushes to ECR. One service's build suddenly fails with "image too large (2 GB)." The reusable workflow's vulnerability scan takes 30 minutes because of the image size. This service's CI now times out regularly. Changing the workflow would affect all 30 services. How do you solve this without duplicating the workflow?
Add conditional logic to the reusable workflow: (1) Introduce an input for vulnerability scanning: `inputs: scan-vulnerabilities: type: boolean default: true`. (2) In the workflow step, conditionally run: `if: inputs.scan-vulnerabilities == true`. (3) The problematic service calls: `with: scan-vulnerabilities: false`. (4) Better: add an input for scan timeout: `inputs: scan-timeout: type: string default: '30m'` and pass it to the scanning step. (5) For the large image issue, add optional image optimization: input for base image tag, multi-stage build instructions, or pre-build cleanup steps. (6) If the service is an outlier, create a specialized workflow for large images that shares the common parts via composition: call the shared `_build-image.yml`, then a different `_scan-large-images.yml`. This keeps the main workflow lightweight for 29 services while allowing customization for the outlier.
Follow-up: Design a system where reusable workflows can auto-detect problematic conditions (timeouts, memory issues) and switch to alternative strategies mid-run.