GitHub Actions Interview — Matrix Builds and Dynamic Strategy

Your team maintains a Node.js library that must support Node 16, 18, and 20. You set up a matrix build with all 3 versions. The tests pass for 16 and 20, but fail for 18. You want to skip 18 for now and let the build succeed (not block on this one failure). How do you exclude 18 from the matrix without editing the workflow YAML?

Use matrix exclusions directly in the workflow YAML: `strategy: matrix: node: [16, 18, 20], exclude: [{node: '18'}]`. This removes the Node 18 job from the matrix. However, if you want to avoid editing the workflow, use environment variables or dynamic matrix generation: (1) Define a GitHub environment variable with the supported versions, (2) parse it in a setup step, and pass it to the matrix via `matrix` input (requires job-level matrix generation, which isn't native—use a workaround). (3) Better approach: fix the test, don't exclude it. But if 18 has a known issue, mark it as `allow-failure: true` so the matrix continues even if that job fails. Note: `allow-failure` doesn't exclude a job; it just prevents failure from blocking. (4) Use a pre-test check: query your project's CI config file (e.g., `.github/ci-config.json`) for supported versions, dynamically construct the matrix. This requires orchestrating the matrix computation in a setup job that outputs the matrix JSON.

Follow-up: How would you dynamically generate a test matrix based on dependencies listed in your package.json?

You have a matrix build testing 3 Node versions × 5 operating systems (macOS, Windows, Linux, ARM, Android) = 15 jobs. The full matrix takes 45 minutes. The team is waiting for CI results before deploying. You want to speed this up: some combinations are redundant (e.g., Node 18 on Windows is the same as Node 18 on Linux). How do you optimize the matrix?

First, question whether all combinations are needed. If Node version differences are the variable, not OS, test all Node versions on a single OS (Linux, since it's fastest), then test each Node version on all OSes only for the latest version. Use matrix exclusions or a hierarchical matrix: (1) Primary matrix: test Node 16, 18, 20 on Linux. (2) Secondary matrix: test Node 20 on macOS, Windows, ARM. This reduces jobs from 15 to 8. (2) Further optimize: run fast tests early (unit tests on all matrices), slow tests later (integration tests only on Linux/Node 20). Use job dependencies and outputs to gate expensive jobs. (3) Use conditional logic: `if: matrix.node == '20' && matrix.os == 'ubuntu'` to run expensive tests only for specific combinations. (4) Parallelize outside the matrix: instead of a 15-job matrix, run 3 jobs in parallel (one per Node version) and each job runs tests on all OSes sequentially (faster than GitHub's overhead of 15 separate jobs). Trade-off: less parallelism, but faster overall due to job startup overhead.

Follow-up: Design a tiered matrix strategy where fast tests run on all combinations and slow tests run only on critical configurations.

Your team has a monorepo with 10 services. You want to run tests for only the services that changed in a PR. You set up a matrix, but the issue: the matrix is static (defined in YAML). You'd need to manually update the workflow YAML every time the monorepo structure changes. You want the matrix to be dynamic.

Use dynamic matrix generation: (1) Create a setup job that detects which services changed using git diff: `git diff origin/main...HEAD --name-only | grep '^services/' | cut -d/ -f2 | sort -u`. (2) Output this as a JSON array: `echo "services=[\"auth\",\"api\"]" >> $GITHUB_OUTPUT`. (3) In the test job, reference this output: `strategy: matrix: service: ${{ fromJSON(needs.setup.outputs.services) }}`. (4) The test job uses `needs: [setup]` to create a dependency. If no services changed, the setup job outputs an empty array, and the matrix creates zero jobs (or one dummy job to mark the workflow as passed). (5) Limitation: GitHub doesn't support fully dynamic job counts; if setup outputs 0 items, the workflow doesn't create 0 jobs—you need at least one. Workaround: always include a "smoke test" job that runs if no services changed. (6) For efficiency, also generate a matrix of affected tests per service, not just services. This avoids running all tests on all services.

Follow-up: Design a system where the matrix adapts based on which services failed tests in the previous run.

Your CI matrix has 20 jobs, and they usually finish in 15 minutes. One day, one job takes 2 hours (network timeout in integration tests), while the other 19 finish quickly. The entire build is now gated on that one slow job. The team is blocked. How do you prevent a single slow job from blocking the whole build in the future?

Separate critical and non-critical jobs: (1) Mark slow tests with `allow-failure: true` so they don't block the build. This prevents one flaky test from blocking deployments. (2) Use a required checks strategy: designate a subset of matrix jobs as "required" (fast, reliable tests), and the rest as "optional" (slow, flaky tests). Set GitHub branch protection to only require the "required" jobs. In your workflow, tag jobs with outputs: `outputs: required: true`. (3) Implement job timeouts: `timeout-minutes: 30` prevents a job from running indefinitely. (4) Better: split the matrix into separate workflows with different timing targets. Critical path (fast tests): must pass to unblock. Secondary path (slow tests): run in parallel, can fail without blocking. (5) Use a job dependency DAG: build jobs (fast), then run tests in parallel, then deploy only if build passes (don't wait for slow tests). Send slow test results to a monitoring dashboard; alert on failures but don't block deployments.

Follow-up: Design a tiered CI strategy where fast tests unblock deployments, and slow tests run asynchronously.

You have a matrix that tests Python versions 3.9, 3.10, 3.11, 3.12. A breaking change in Python 3.12's standard library broke your tests. You want to exclude 3.12 temporarily, but another engineer keeps accidentally re-including it by reverting your change. How do you prevent this?

Encode the exclusion outside the workflow YAML: (1) Create a `.github/matrix-config.json` file with the supported versions and exclusions: `{ "python": ["3.9", "3.10", "3.11"], "exclude": ["3.12"] }`. (2) In the workflow setup step, read this file and compute the matrix: `matrix=$( cat .github/matrix-config.json | jq -r '.python | @json' )`. (3) The matrix YAML then becomes: `matrix: python: ${{ fromJSON(needs.setup.outputs.matrix) }}`. (4) An engineer changing this file triggers a PR review—easier to catch than accidentally editing the workflow YAML. (5) Add a test to verify the matrix is valid JSON. (6) Document clearly: "Modify `.github/matrix-config.json` to change the test matrix, not the workflow YAML." (7) Make the exclusion temporary: add a comment with an issue link: `"exclude-3.12": "See issue #1234 for status"`, so it's obvious when it's safe to re-enable.

Follow-up: How would you automate re-enabling an excluded version once a fix is available?

Your matrix tests your app on 8 browsers (Chrome, Firefox, Safari, Edge) × 3 OS (Windows, macOS, Linux) = 24 jobs. Each job runs 2 hours. A team member mistakenly ran the workflow on every commit to a feature branch, triggering 24 jobs × 50 commits = 1,200 jobs in 24 hours. This quadrupled your GitHub Actions billing and consumed all concurrent runners. How do you prevent this?

Implement concurrency controls: (1) Use `concurrency` to limit parallel matrix jobs: `concurrency: group: ${{ github.workflow }}-${{ github.ref }}, cancel-in-progress: true`. This ensures only one workflow run per branch; new commits cancel old runs. (2) Reduce matrix combinations on feature branches: `if: github.ref == 'refs/heads/main' then run full matrix, else run smoke tests only`. Use matrix inclusion/exclusion based on the branch. (3) Implement job-level timeouts and resource limits. (4) Require approval before running expensive workflows: use `workflow_dispatch` with manual approval, or status checks that require review. (5) Set GitHub Actions spending limits at the org level to auto-disable workflows that exceed a threshold. (6) Rate-limit matrix expansions: if the matrix would create >10 jobs, require manual approval. (7) For feature branches, run a smaller matrix (2 browsers × 1 OS = 2 jobs) for quick feedback, then run full matrix only on main/before merge.

Follow-up: Design a matrix strategy that reduces test scope on PRs but runs full tests on main and releases.

You have a complex matrix with OS, Node version, database type (PostgreSQL, MySQL), and Redis version. 2 of the 48 combinations fail consistently: Node 16 + PostgreSQL + Redis 6, and Node 20 + MySQL. These are edge cases you're not ready to fix. You want to exclude them without bloating the workflow YAML with too much logic.

Use a matrix exclusion list with clear documentation: (1) Define the matrix variables, then exclude known-failing combinations: `strategy: matrix: node: [16, 20], db: [postgres, mysql], redis: [6, 7], exclude: [{node: '16', db: 'postgres', redis: '6'}, {node: '20', db: 'mysql'}]`. (2) Add comments explaining why: `# TODO: Node 16 has memory leak with PostgreSQL 16, fix in #1234`. (3) To keep the workflow DRY, extract the matrix and exclusions to a separate config file (`.github/test-matrix.json`) and load it: `matrix: ${{ fromJSON(needs.setup.outputs.matrix) }}`. (4) Set up an automated check: every 30 days, re-enable all exclusions temporarily in a separate workflow run (dry-run mode) to detect when issues are fixed. Update the PR description with findings. (5) Link each exclusion to an issue: when the issue closes, automatically remove the exclusion. (6) For large exclusion lists (>5 items), seriously consider if the matrix is too complex—consider splitting it into separate workflows.

Follow-up: How would you implement an automated system that periodically tests excluded matrix combinations to detect when they're fixed?