GitHub Actions Interview — Debugging and Troubleshooting Workflows

A workflow job fails with a cryptic error: "Error: The operation was not successful." There's no additional context. The logs show job execution but the error message is unhelpful. You have no idea why it failed. How do you troubleshoot?

Enable debug logging: (1) GitHub Actions has a hidden debug mode. Set an environment variable in the GitHub org or repo: `ACTIONS_STEP_DEBUG: true`. This prints detailed logs for each action step. (2) Go to Settings > Actions > General > Debug logs and enable it. (3) Re-run the failed job with debug enabled. The logs will show step-by-step execution, environment variables, and error context. (4) Look for the actual error: often the true error is buried in debug output, not in the main failure message. (5) Check runner logs: if the runner itself is the issue (not the action), SSH into the runner and check system logs: `/var/log/syslog` (Linux) or Event Viewer (Windows). (6) Use `set -x` in bash scripts to trace execution: every command is printed before running, showing exactly where it fails. (7) Add explicit error handling: instead of letting the job fail silently, add error handling: `if [ $? -ne 0 ]; then echo "Command failed with code $?"; exit 1; fi`. This provides context. (8) For Python/Node: use try-catch blocks and log exceptions. (9) Create a minimal reproduction: if the job is complex, isolate the failing step and run it standalone to understand the issue.

Follow-up: How would you implement structured logging to make debugging easier?

A workflow runs fine locally on your machine but fails in GitHub Actions. The error: "npm: command not found." You have Node.js installed locally. Why would the runner not have npm?

The runner environment is different from your local machine. Possible causes: (1) The runner image doesn't include npm. GitHub's standard Ubuntu runner includes Node, but custom runners might not. Check: `node --version && npm --version` in the workflow. If this step fails, npm isn't installed. (2) Wrong runner OS: if your workflow specifies `runs-on: windows-latest` but you tested on Linux, npm might be installed differently or missing. (3) The step that runs npm doesn't have it in PATH. If a previous step modifies PATH and removes Node's bin directory, npm won't be found. (4) For self-hosted runners: npm might not be installed on the runner machine. SSH into it and install Node/npm. (5) Solution: add an explicit setup step: `uses: actions/setup-node@v4 with: node-version: '18'`. This ensures Node/npm are available. (6) Or, containerize: `container: node:18-alpine` runs the job inside a Docker container with Node pre-installed. (7) Verify: add a debug step at the beginning: `- run: which npm && npm --version`. This confirms npm is available before running your actual steps.

Follow-up: How would you create a runner health check that validates all required tools are installed?

A workflow step downloads a file from an external URL. Occasionally (10% of the time), the download fails with a timeout. The failure is intermittent: sometimes the same URL downloads fine, sometimes it times out. The job fails, and the entire workflow is blocked. How do you handle flaky external dependencies?

Implement resilience for external calls: (1) Add retry logic: use a shell loop or a dedicated action like `nick-invision/retry@v2`: `uses: nick-invision/retry@v2 with: timeout_minutes: 10 max_attempts: 3 command: curl -O https://example.com/file.zip`. (2) Set timeouts: `curl --max-time 30` or `wget --timeout=30` to fail fast instead of waiting indefinitely. (3) Use exponential backoff: don't retry immediately; wait between attempts. First retry after 2s, second after 4s. This gives the remote server time to recover. (4) Check URL health: before downloading, ping the URL: `curl -I https://example.com/file.zip`. If the server is down, fail immediately (rather than timing out). (5) Use fallback URLs: if the primary URL fails after 3 retries, try a mirror or cached version. (6) For critical dependencies: mirror them locally. Instead of downloading from an external source, maintain a copy in S3 or your own CDN. (7) For rate-limiting: GitHub Actions runners have shared IP addresses. If many jobs hit the same external server simultaneously, they might be rate-limited. Stagger downloads or use a shared cache. (8) Implement circuit breaker: if a URL has failed 10 times in the past hour, assume the service is down and skip the download (or alert). (9) Monitor: log all external calls and their latency. Track which URLs are flaky. Prioritize fixing or mirroring the worst offenders.

Follow-up: Design a resilient dependency download system with caching and fallbacks.

A workflow runs 100+ steps. One step fails, and the entire workflow stops. You see only the failed step's error message. You don't know what state the system is in after 50 preceding steps. Debugging this is time-consuming. How do you better structure workflows for troubleshooting?

Implement structured workflow design: (1) Break workflows into smaller jobs: instead of one 100-step job, create 10 jobs with 10 steps each. This gives you isolation: if a job fails, you know exactly which phase failed. (2) Use meaningful names: `Build`, `Test`, `Deploy-Staging`, `Deploy-Prod`. These tell you the workflow stage. (3) Add checkpoints: after each major phase, output a summary: `echo "Build completed. Image: $IMAGE_ID"`. This confirms state. (4) Use status checks: after each critical step, verify success: `if [ $? -eq 0 ]; then echo "✓ Step succeeded"; else echo "✗ Step failed"; exit 1; fi`. (5) Logging: use structured logging (JSON). Each log line includes timestamp, step, status, context. Easy to parse and analyze. (6) Save state: after important steps, save the state to a file or artifact. If a later step fails, you can inspect the state file. (7) Use on-failure hooks: if a job fails, automatically capture state: logs, environment, snapshots. (8) For complex workflows: use a workflow orchestrator (GitHub Actions isn't ideal for 100+ step workflows; consider external tools like Argo, Jenkins for better visualization). (9) Dry-run mode: add a `--dry-run` flag to workflows. Simulate execution without side effects, catching errors early.

Follow-up: Design a workflow structure that maximizes debuggability and fault isolation.

You have a workflow that sometimes succeeds and sometimes fails based on random conditions (time of day, runner capacity, network). You can't reproduce it locally. How do you debug non-deterministic failures?

Identify the non-determinism: (1) Is it timing-related? If the test passes 80% of the time but fails 20%, it's likely a race condition or timeout issue. Run the test multiple times (`for i in {1..100}; do ./test.sh; done`) and check failure pattern. (2) Is it environment-dependent? Different runners (ubuntu-22.04 vs ubuntu-latest) have different configs. Check which runner failed vs. passed. (3) Use randomization + seeds: if the test involves randomness (shuffled order, random data), set a seed so it's reproducible: `RANDOM_SEED=42 ./test.sh`. (4) Add logging: log every decision point, timestamp, and branch taken. This helps identify where non-determinism occurs. (5) Increase test runs: run the test 100+ times in CI. Graph the results: is failure rate 5%, 50%, or 95%? This reveals severity and patterns. (6) Isolate variables: run the test with and without various conditions. E.g., test with caching enabled/disabled, with different Node versions, different OS. (7) Use thread sanitizers / race detectors: if the code has concurrency, tools like ThreadSanitizer or Go's `-race` flag detect data races. (8) For time-dependent tests: mock time. Use Jest's `jest.useFakeTimers()` or similar to control time, eliminating timing issues. (9) For external service tests: mock/stub the service responses. Eliminate network flakiness. (10) After identifying the root cause, add a fix: increase timeout, add synchronization, fix race condition.

Follow-up: How would you implement automated test isolation and flakiness detection?

A deployment workflow deploys code to production. It fails partway through: the database migration succeeds, but the application deployment fails. The database is now in a "new schema" state, but the old application is running (doesn't understand the new schema). Your system is broken. How do you prevent partial deployments?

Use transactions and atomic operations: (1) Design deployments to be atomic: either all steps succeed or all fail. If any step fails, roll back previous steps. (2) For databases: wrap migrations in transactions. If a migration fails partway, the transaction rolls back. (3) For application deployments: use blue-green deployments. Deploy the new version to a parallel environment (green), verify it works, then switch traffic from the old (blue) to new (green). If green fails, traffic stays on blue. (4) Use canary deployments: deploy to 1% of servers first, monitor for errors, then gradual rollout to 100%. If errors detected, automatic rollback. (5) Add pre-deployment checks: before starting deployment, verify the new code is compatible with the current database schema. Run compatibility tests. (6) For complex deployments: use a deployment tool with built-in rollback (e.g., Helm with rollback on failure, Spinnaker with deployment strategies). (7) Implement idempotent steps: if a step fails and is re-run, it should be safe. E.g., database migrations should be idempotent (running twice = running once). (8) Add a rollback job: if deployment fails, automatically trigger a rollback. Example: `if: failure()` run `./rollback.sh`. (9) For critical systems: require multiple approval stages and automated smoke tests after deployment. Catch failures immediately, before they affect users.

Follow-up: Design a deployment strategy that prevents partial deployments and enables fast rollback.

Your workflow is slow: it takes 1 hour to build and deploy. Engineers complain: "I have to wait too long for CI feedback." You want to speed it up, but you're not sure where the bottleneck is. How do you profile a workflow?

Measure and profile: (1) Add timing to each step: use `bash -x` or add explicit timers: `time [command]`. This shows how long each command takes. (2) In GitHub UI: each step shows duration. Sort steps by duration—the longest steps are bottlenecks. (3) Run a profile: commit a workflow that times each step and outputs a summary: `steps: - name: Build && time npm run build - name: Test && time npm test`. Output: `Build: 30 min, Test: 20 min, Deploy: 10 min`. (4) Identify the top 3 bottlenecks: 80% of slowness is usually caused by 20% of the steps. Focus there. (5) Common bottlenecks: (a) downloading dependencies (fix with caching), (b) waiting for external services (mock or parallelize), (c) poor test parallelization (spread tests across workers), (d) slow build/compilation (optimize build process, use incremental builds). (6) Parallel vs. sequential: are slow steps running in parallel? If they run sequentially and can run in parallel, split into separate jobs with concurrent execution. (7) Benchmark: before/after optimization. Measure time on an unoptimized branch vs. optimized branch. (8) For 1-hour workflows: aggressive optimization. Aim to get to 15 minutes or less—this is the threshold where developers won't get context-switched waiting for CI.

Follow-up: How would you create a performance dashboard that tracks workflow duration over time?

A self-hosted runner stops accepting jobs. GitHub shows it as "offline." You SSH into the runner machine and everything looks normal (disk space OK, network OK, process running). Why is the runner offline?

Common causes of offline runners: (1) Runner process crashed silently: check if the runner process is running: `ps aux | grep runsvc.sh` or `systemctl status actions-runner`. If not running, restart: `/svc.sh start`. (2) Network connectivity: the runner lost connection to GitHub. Check: `ping github.com`, `curl https://api.github.com`. If DNS fails, check `/etc/resolv.conf` or network settings. (3) Firewall / proxy blocking: GitHub's IP might be blocked. Check firewall rules or proxy settings. (4) Registration token expired: if the runner was registered >1 year ago, the token might have expired. Re-register with a new token from GitHub. (5) Disk space or memory exhausted: check `df -h` (disk) and `free -m` (memory). If full, clean up: delete old job artifacts, logs, temporary files. (6) Time sync issues: if the runner's clock is significantly behind GitHub's, authentication fails. Check: `date` on runner and compare to your local time. Resync if needed: `timedatectl set-ntp true`. (7) GitHub infrastructure issue: check GitHub's status page. If GitHub is experiencing issues, runners will appear offline. (8) Debug: check runner logs: `tail -f /home/actions/actions-runner/_diag/` (contains logs). Look for connection errors, authentication failures. (9) Last resort: re-register the runner completely: stop it (`/svc.sh stop`), re-download the agent, re-register with a fresh token.

Follow-up: How would you implement automatic health checks for self-hosted runners?