GitHub Actions Interview — Artifacts and Workflow Outputs

Your CI workflow builds an artifact (compiled binary, 200 MB). The artifact is used by a deployment job that runs 1 hour later. You uploaded the artifact to GitHub's artifact storage, but 90 days later, GitHub automatically deleted it. A deployment job later that month tried to download the deleted artifact and failed. How do you handle artifact retention?

GitHub retains artifacts for 90 days (or your org's retention policy, configurable). For long-term storage: (1) If the artifact is needed long-term, don't rely on GitHub's temporary storage. (2) Upload to a persistent location: S3, GCS, artifact registry, Docker registry. This artifact persists indefinitely (or until you delete it). Example: after building, push the binary to S3: `aws s3 cp binary.tar.gz s3://my-bucket/releases/v1.0.0/`. (3) For Docker images, push to a registry (Docker Hub, ECR, GCR). Images are versioned and persist indefinitely. (4) For GitHub artifacts specifically: use `retention-days` to control how long artifacts are kept. Default is 90; increase for critical artifacts: `uses: actions/upload-artifact@v4 with: retention-days: 365`. (5) Strategy: use GitHub artifacts for short-lived data (test results, logs, intermediate builds used within hours). Use external storage for long-lived data (releases, Docker images, binaries deployed to production). (6) For compliance (e.g., you need to retain build artifacts for 7 years for audits), use S3 with Glacier archival: store cold data in Glacier (cheaper) for long retention.

Follow-up: Design an artifact lifecycle management system that balances storage cost and availability.

Your workflow uploads a 500 MB artifact. The next job downloads it. This seems wasteful: the artifact is stored on GitHub's servers, then downloaded back to the runner. You're paying for storage and bandwidth. Is there a way to avoid this?

GitHub's artifact storage is meant to be temporary and is efficient enough for most use cases. However, if the artifact is used only within the same workflow run (no external download): (1) Use job outputs instead of artifacts: if the data is small (metadata, file paths, strings), pass it via `echo "key=value >> $GITHUB_OUTPUT"` and reference it in the next job: `${{ needs.build.outputs.key }}`. (2) For large binary artifacts: (a) use a container registry: build the Docker image, push to ECR/Docker Hub, then pull it in the deployment job. No intermediate storage on GitHub. (b) Use S3 (or similar) directly: the build job writes to S3, the deployment job reads from S3. No GitHub artifact storage. (3) If you must use GitHub artifacts: compress them (`tar.gz` instead of raw binary) to reduce storage and download time. (4) For same-workflow scenarios: use a persistent volume (not available in standard GitHub Actions). Or use a self-hosted runner with local storage. (5) For most cases: GitHub's artifact storage + transfer is fast (same data center, optimized) and not a bottleneck. Focus on optimizing build time, not artifact transfer.

Follow-up: When would you use artifacts vs. container registry vs. S3 for each scenario?

Your workflow has multiple jobs: build, test, deploy. Each job produces outputs (e.g., build produces `build-id`, test produces `test-results`). A downstream job needs to use outputs from both. Currently, you're manually tracking outputs in comments and passing via env vars. This is error-prone.

Use GitHub's job outputs systematically: (1) Each job defines outputs in its final step: `- name: Export metadata run: echo "build-id=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT`. (2) In the next job, reference outputs: `jobs: deploy: needs: [build, test] steps: - run: echo ${{ needs.build.outputs.build-id }}`. (3) For multi-dependency scenarios: `needs: [build, test, security-scan]` makes outputs from all upstream jobs available. (4) To prevent confusion, document outputs clearly: (a) add comments in each job explaining what it outputs, (b) use consistent naming conventions (`build-id`, `test-summary`, etc.), (c) validate outputs format (is it a UUID, a file path?). (5) For complex data (structured JSON), use `jq` to parse and extract: `test_summary=$(cat test-result.json | jq -r .summary) && echo "summary=$test_summary" >> $GITHUB_OUTPUT`. (6) Limitations: outputs are strings (max ~1MB). For large data, use artifacts. (7) For outputs used across multiple workflows (not just jobs), store in a database or artifact registry and use external service to query.

Follow-up: How would you handle complex, nested JSON outputs across multiple jobs?

Your workflow stores test results in an artifact. A month later, you want to analyze trends: "How has test success rate changed over 30 days?" The test results are scattered across 30 different artifact uploads (one per day). How do you aggregate them?

Don't rely on GitHub artifacts for long-term analytics. Instead: (1) Export to a time-series database: after each test run, parse results and push to InfluxDB, Prometheus, or CloudWatch. Query historical data via the database's API. (2) Use GitHub's repository insights: GitHub provides some analytics (workflow runs, success rate). The API (`/repos/{owner}/{repo}/actions/runs`) returns metadata for all runs. Query this: filter runs by date, count passes/failures. (3) For structured data: store test results in a data warehouse (BigQuery, Redshift). Each workflow run pushes results there. Then run analytics queries. (4) Alternative: post-process artifacts. Create a nightly job that downloads all artifacts from the past 30 days, aggregates them, and outputs a summary report. This is slow but works. (5) Use a third-party service: tools like LaunchDarkly, Datadog, or New Relic integrate with GitHub and collect workflow metrics automatically. (6) Store in a human-accessible format: instead of binary artifacts, upload JSON or CSV summaries. Easy to query, version control, audit. (7) For immediate historical data: start pushing to a database now. Past data won't be available, but future data will be comprehensive.

Follow-up: Design a continuous testing analytics system that tracks trends over time.

Your team uses artifacts to pass data between jobs. A build job uploads artifact `build-output.tar.gz`. The deployment job tries to download it but gets "artifact not found." The build job completed successfully. Why might this happen?

Possible causes: (1) Artifact retention expired: if the build job ran >90 days ago, GitHub deleted the artifact. Check: how long ago did the build run? If >90 days, re-run the build job to recreate the artifact. (2) Artifact scope: artifacts are scoped to the workflow run. If you're downloading from a different workflow run (e.g., re-running only the deployment job), it won't find artifacts from an old build job run. The entire workflow (build + deploy) must run together. (3) Name mismatch: the artifact is uploaded with name `build-output` but the download step looks for `build-output.tar.gz`. GitHub stores artifacts with the name you provide; the filename inside might be `build-output.tar.gz`, but the artifact name is `build-output`. (4) Build job didn't complete: if the build job was cancelled or failed before the upload step, no artifact was created. Check the build job's logs. (5) Conditional upload: the build job has `if: success()` on the upload step. If the job failed partially, the upload might be skipped. (6) Permission issue: for private repos, ensure the deployment job has `contents: read` permission to download artifacts. (7) Solution: ensure build and deploy jobs run in the same workflow invocation. Use `needs: [build]` to create dependency. Check artifact name matches exactly. For long-lived artifacts, upload to S3 instead of GitHub's temporary storage.

Follow-up: How would you implement a system that automatically retries artifact downloads with fallback to alternative storage?

Your workflow produces test reports (HTML, JSON, screenshots). Some reports are huge (50 MB for screenshots). Uploading and storing these artifacts is expensive. But QA needs access to them for analysis. How do you balance storage cost with accessibility?

Implement tiered artifact storage: (1) Short-lived: store high-volume, large artifacts (screenshots) for only 7 days. Set `retention-days: 7`. (2) Medium-term: structured data (JSON test results) for 30 days. (3) Long-term: summary reports (aggregated pass/fail counts) for 1 year. (4) Compress aggressively: `tar.gz` can reduce size by 70% for text. Screenshots might not compress much (already PNG/JPEG). Use `xz` for maximum compression (slower but smaller). (5) Selective upload: don't upload all artifacts. Example: upload full reports only for failed tests, only screenshots from failed scenarios. Pass-only test runs upload only the summary (tiny). (6) External storage: for expensive artifacts (video recordings, large test data), use S3 Glacier. Cost is ~$0.01/GB/month vs. GitHub's $0.25/GB/month. (7) Asynchronous processing: immediately after test run, push results to S3. Delete from GitHub artifacts after 7 days (save storage). When QA needs old reports, they pull from S3. (8) Generate on-demand: instead of storing full HTML reports, store raw JSON results. QA generates HTML reports on-demand from JSON (fast, cheaper). (9) Retention policy: enforce via workflow automation. Every 30 days, archive old artifacts to Glacier.

Follow-up: Design a cost-optimized artifact storage strategy with tiered retention and selective uploads.

Your CI publishes build artifacts (binaries, libraries) to a package registry (npm, PyPI, Maven). An artifact for version 1.0.0 is published. Later, you discover a critical security bug. You delete the artifact from the registry, but developers who already downloaded 1.0.0 still have the vulnerable version. They're unaware. How do you handle this?

Deleting artifacts doesn't reach developers who cached them. Better approach: (1) Never delete published artifacts; instead, mark them as "yanked" (deprecated). npm, PyPI, Maven all support yanking. The version is hidden from install-by-default but cached copies still work for existing installations. (2) Release a patched version immediately: 1.0.1. Announce: "1.0.0 has a critical security bug; upgrade to 1.0.1." (3) Use vulnerability scanners: npm audit, pip check, Maven dependency-check will flag 1.0.0 as vulnerable once you publish an advisory. Developers see warnings. (4) Publish a security advisory: on GitHub, create a security advisory linking the vulnerability to the affected version. Users get notifications. (5) For severity: if the bug is critical (remote code execution, data leak), also push a yanked version to the registry with a big warning in the description. (6) For already-deployed artifacts: if binaries are deployed to production servers or Docker registries, you can't recall them. This is why deployment testing is critical—catch bugs before production. (7) Post-mortem: implement pre-publish checks to prevent publishing vulnerable versions. Use SBOM generation, CVE scanning, and security reviews before releasing.

Follow-up: How would you implement pre-release security scanning to prevent publishing vulnerable artifacts?

Your workflow uploads multiple artifacts from different jobs: test-report, coverage, build-log, etc. The deployment job needs only the build-log. Downloading all artifacts wastes time and bandwidth. Can you download selectively?

Yes, use `actions/download-artifact@v4` with selective filters: (1) Download specific artifact: `uses: actions/download-artifact@v4 with: name: build-log` downloads only that artifact. (2) If you don't specify `name:`, all artifacts are downloaded (slower). (3) For advanced filtering (e.g., download all artifacts matching a pattern): there's no built-in regex support, but you can use a custom script. Example: `gh run download [run-id] --pattern "test-*"` (requires GitHub CLI). (4) For conditional downloads: `if: job.status == 'failure'` download logs, else skip them. (5) Alternative: use job outputs instead of artifacts for small data. Only use artifacts for large files. (6) Design artifacts to be modular: each job uploads one focused artifact, not a monolithic tarball. This allows selective downloads. (7) For monolithic artifacts: use paths within the artifact. Upload `build-output/report.html`, `build-output/log.txt`, etc. Then use selective extraction: `tar xzf build-output.tar.gz -C . build-output/log.txt` (extracts only the log file, skips report.html). This is efficient.

Follow-up: Design a modular artifact structure that supports selective download and processing.