Your team's CI job downloads npm dependencies (node_modules is 500 MB) every single run—takes 3 minutes per job. With 50 jobs/day, that's 150 minutes of wasted time. Network bandwidth is also expensive. You know caching exists but aren't sure how to set it up safely.
Use GitHub's caching action. Create a cache key based on the dependency lockfile: `uses: actions/cache@v4 with: path: node_modules key: npm-${{ hashFiles('package-lock.json') }} restore-keys: npm-`. GitHub stores the cache for 7 days. On a cache hit, npm install finishes in <10 seconds (skipping download). Key insight: always hash the lockfile (`package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`), not timestamps or branch names. This ensures cache invalidation when dependencies actually change. If you just use branch names, you'll cache stale dependencies. For safety: (1) cache only dependencies, not build artifacts (they might be huge/stale). (2) Run `npm install` even with a cache hit—npm validates the cached modules against the lockfile. (3) Set cache compression: GitHub compresses the cache automatically. (4) Limit cache to the data you need—don't cache `.git`, node_modules/.bin, or build outputs.
Follow-up: How would you handle cache invalidation for transitive dependency updates without modifying the lockfile?
You set up npm caching. Most jobs hit the cache and finish quickly. But occasionally, a job reports "cache not found" even though other jobs in the same branch are using the same dependencies. The jobs are inconsistent: some are fast (cache hit), some are slow (cache miss). Why?
GitHub caches are scoped to a branch and repo. Possible issues: (1) Cache key mismatch: your job computed the key differently than previous runs (e.g., you updated the hash function or dependencies changed). Verify: each job logs its cache key—check if they're identical. (2) Cache eviction: GitHub deletes caches that haven't been accessed in 7 days. If your main branch hasn't been updated in a week, the cache expires. Solution: rebuild dependencies or pin cache expiry. (3) Multiple workflows competing: if a different workflow ran on the same branch and changed the lockfile, it invalidates the cache. Check: git log to see if package-lock.json changed. (4) Runner/platform differences: if you're running on different runner types (e.g., ubuntu-latest vs ubuntu-22.04), they have separate caches. Use consistent runner labels. (5) Cache upload delays: GitHub uploads cache after the job finishes. If jobs run in parallel and the first one hasn't uploaded yet, the second job sees a cache miss. Solution: ensure dependency installation jobs run sequentially, or stagger them with `needs:` dependencies.
Follow-up: How would you implement a "cache warmup" job that runs on schedule to prevent cache expiry?
You cache dependencies successfully for npm. Now you're adding Python (pip) and Java (Maven) to the same workflow. The cache is huge (1.5 GB total). GitHub's cache storage is limited and expensive. Each cache entry consumes 1 GB; you're close to limits. How do you optimize?
Separate caches by ecosystem and be selective about what you cache. Strategy: (1) Separate cache keys: `npm-${{ hashFiles('package-lock.json') }}`, `pip-${{ hashFiles('requirements.txt') }}`, `maven-${{ hashFiles('pom.xml') }}`. Don't create one monolithic cache. (2) Cache only what's expensive to rebuild: npm packages (yes), Maven artifacts (yes, ~/.m2/repository is slow to download), Python packages (maybe—pip install is usually fast, but with 1000+ packages, cache them). (3) Exclude large/rebuild-able content: don't cache build outputs, compiled binaries, or test results. (4) Use action-specific actions: `actions/setup-python@v4` has built-in pip caching (automatic). `actions/setup-node@v4` has built-in npm caching. Use these instead of manual cache actions. (5) For Maven, cache only ~/.m2/repository, not target/ directories. (6) Monitor cache usage: GitHub provides analytics. If you're exceeding limits, implement a cache cleanup script that removes entries older than 30 days (if not accessed recently). (7) Consider alternatives: for very large dependencies, use a Docker image with pre-built dependencies instead of caching—often faster.
Follow-up: Design a multi-language project where each language's dependencies are cached independently with TTL enforcement.
A PR updates a dependency in package-lock.json from version 1.0.0 to 1.1.0. The CI job runs, hits the old cache (version 1.0.0), and proceeds with the old dependency. The tests pass, but when deployed to production, the new dependency breaks. How do you prevent this?
This is a cache invalidation failure. The issue: npm install is smart—it validates the lockfile against the cache. If lockfile changed, npm should re-download. Verify this is working: (1) Cache key includes `hashFiles('package-lock.json')`, so lockfile changes should change the cache key. But if your hash function is wrong (e.g., using `github.sha` instead of file hash), you'll get cache misses. (2) Run `npm install --frozen-lockfile` or `npm ci` (clean install)—these strict modes ensure the exact lockfile is used, preventing version drift. (3) Debug: print the cache key before and after the dependency update. If they're identical, the hash function is wrong. (4) Add a validation step after npm install: verify the installed version matches the lockfile. Example: `npm ls express | grep express@1.1.0` or parse package-lock.json and compare. (5) For CI safety, disable caching on dependency update PRs. Mark dependencies updates with a label, then: `if: contains(github.event.pull_request.labels.*.name, 'dependency-update') then skip-cache`. (6) Better: implement lock file verification in your tests—if there's a mismatch between package.json and installed version, fail the build.
Follow-up: How would you detect and alert if the installed dependencies don't match the lockfile during CI?
Your Docker build caches layers. You have a `FROM node:18-alpine` base image layer that's cached. One day, Node 18 releases a security patch. Your local Docker cache still has the old version. CI picks up the new patch immediately (fresh download), but your local dev environment doesn't. Developers unknowingly test against an old, vulnerable version.
This is a layer cache invalidation issue. The base image tag `node:18-alpine` is mutable—it gets updated when security patches are released. But Docker's layer cache is immutable by tag. To fix: (1) Always pull fresh base images: `docker build --pull --no-cache` on CI. Never use cached layers for base images. (2) Invalidate caches on a schedule: use a cron trigger to rebuild and push images to your registry weekly. (3) Use digest-based tags instead of semver tags: pin to a specific image hash: `FROM node:18-alpine@sha256:abcd1234`. Update the digest weekly in your Dockerfile—forces a rebuild, and developers know when they update. (4) For development: `docker pull node:18-alpine` before building locally to ensure freshness. (5) In GitHub Actions, the Docker build action (`docker/build-push-action@v5`) has a `pull` option: `pull: true` always re-pulls base images. (6) For multi-stage builds, minimize layers: each RUN command creates a cacheable layer. Combine RUN commands to reduce layers and improve cacheability of final output.
Follow-up: Design a system where developers automatically get alerts when base image security patches are available.
Your CI uses a dependency cache. On rare occasions, the cache becomes corrupted (some files are truncated). When the build pulls the corrupted cache, it uses broken files, the build fails, and the workflow blocks. Regenerating the cache requires a full rebuild (30 minutes).
Implement cache validation: (1) After restoring the cache, validate it. For npm: `npm install --verify-node-versions-ignore-local-cache` and `npm ls` to verify the tree. For Python: `pip check` to verify package compatibility. (2) Compute a checksum of cached files and compare post-restore. If checksum mismatches, the cache is corrupted; fail fast and rebuild. (3) Set up automatic cache cleanup: if a cache entry is used by a failed job, mark it as suspect and delete it. Future runs will regenerate. (4) Use cache-restore fallback: if restore fails, automatically fall back to `npm install` (no cache). This trades speed for reliability. (5) Implement a separate "cache validation" job that runs on a schedule (weekly), restores caches, validates them, and alerts if corruption is detected. (6) Use GitHub's native cache action with built-in checksums (it validates on restore automatically). Always use the latest action version—older versions didn't have robust validation.
Follow-up: Design a cache health monitoring system that proactively identifies and removes corrupted caches.
You have a monorepo with 20 services. Each service has its own node_modules cache key. When service A's dependencies change, you want to bust only that service's cache, not all 20. However, if a transitive dependency (used by all services) changes, you want to bust all caches at once. How do you structure the cache key strategy?
Use hierarchical cache keys with a shared dependency hash: (1) Compute two hashes: one for service-specific dependencies (`services/auth/package-lock.json`) and one for shared dependencies (`package-lock.json` at root). (2) Cache key: `npm-${{ hashFiles('package-lock.json') }}-${{ hashFiles('services/auth/package-lock.json') }}`. (3) Restore keys (fallback) in order: first try exact match, then service-specific match, then root match. Example: `restore-keys: npm-${{ hashFiles('package-lock.json') }}- npm-`. (4) When root dependencies change, the root hash changes—all cache keys change (all miss). When only service A changes, only its cache key changes. (5) This requires discipline: shared dependencies must be in root package-lock.json, service-specific in each service directory. (6) Alternative: use a single workspace cache, but split into multiple cache entries: one per service + one for shared. This requires multiple cache restore/save steps, which adds overhead, so avoid unless you have >50 services.
Follow-up: Design a cache strategy for a monorepo that minimizes invalidation when any service's dependencies change.
You optimized caching and cut build time from 15 minutes to 5 minutes. Now the team is adding a database schema validation step that requires a full database dump (200 MB) to be cached. Caching this would hit GitHub's storage limits immediately. What's your alternative to caching?
Don't cache the database dump. Instead: (1) Store the database dump in a persistent location outside GitHub's cache (S3, artifact registry, Artifactory). Reference it via URL: `curl -o schema.sql https://s3.example.com/db-schema.sql`. This bypasses GitHub's cache limits and allows longer retention. (2) Use a container with pre-built schema: create a Docker image with the database and schema pre-initialized, push it to your registry. At test time, `docker pull` the image (cached on the runner if present) and spin up a container. This is faster than downloading a dump and hydrating a fresh database. (3) Compute-on-demand: generate the schema dynamically at test time (migration scripts). If generation is cheap (<30s), don't cache—it's not worth the overhead. (4) Use a service container in GitHub Actions: define a `services:` block in your workflow that runs PostgreSQL/MySQL in a container with the schema pre-loaded. This is ephemeral (not cached) but fast (<5s to start). (5) Segment storage: cache small, fast-changing things (npm packages), store large, slow-changing things (database dumps) externally. This keeps GitHub cache lean and your build fast.
Follow-up: Design a caching strategy that uses containers for large schemas and GitHub cache for small dependencies.