GitHub Actions Interview — Larger Runners and GPU Builds

Your team's machine learning model training takes 4 hours on a standard runner (2 vCPU). A colleague suggests using a larger runner (8 vCPU, 32 GB RAM). Cost is 4x ($0.32/min instead of $0.008/min). The training might finish in 1.5 hours (3x faster due to parallelization). Is the larger runner worth it?

Calculate ROI: (1) Current cost: 4 hours × 60 min × $0.008 = $1.92 per training run. (2) Larger runner cost: 1.5 hours × 60 min × $0.32 = $28.80 per training run. (3) Cost difference: $28.80 - $1.92 = $26.88 extra per run. (4) For ROI: if training runs once per day, extra cost = $26.88 × 30 = $806/month. That's expensive. (5) However, speedup matters for feedback loops: faster training means faster iteration. If engineers iterate 5x per day, and faster training unblocks them sooner (saves context-switch time), the value might justify the cost. (6) Consider developer productivity: 4 hours waiting → 1.5 hours waiting. If this frees up engineers to do other work, the productivity gain might exceed the cost. (7) Better approach: (a) Optimize the training code first. Maybe parallelization or caching can achieve 2x speedup without extra cost. (b) Use GPU runners instead of CPU runners. GPUs might achieve 10x speedup for ML training at only 2x cost. (c) Schedule training during off-peak hours (nightly). Larger runners might have discounts then. (d) Use spot instances (if available) for non-critical training, slashing cost by 70%.

Follow-up: How would you analyze and optimize ML training workloads for cost and speed?

Your team uses GitHub Actions GPU runners for GPU-accelerated builds. You set `runs-on: ubuntu-latest-4-cores-16gb-gpu`. Every build requires 30 minutes to install CUDA drivers and compile. The GPU acceleration only helps the last 5 minutes (actual model training). You're wasting 25 minutes on setup. How do you optimize?

Pre-bake the GPU environment: (1) Use a Docker image with CUDA pre-installed: instead of installing CUDA during the build, use an official NVIDIA image: `FROM nvidia/cuda:12.0-runtime-ubuntu22.04`. The image already has drivers, toolkit, and libraries. (2) Build a custom Docker image: add your project's dependencies (Python packages, ML frameworks) to the CUDA image. Push to a registry. When the workflow runs, `docker pull` it (cached after first download). Setup time drops from 30 min to 2 min. (3) Use a persistent runner: if possible, use a self-hosted GPU runner on your infrastructure. The environment stays persistent between runs—no reinstalls. (4) Cache CUDA installations: GitHub's cache action can cache the CUDA install directory. On re-runs, restore it. Example: `actions/cache@v4 with: path: /usr/local/cuda key: cuda-12.0-${{ runner.os }}`. (5) For workflows that don't need full GPU setup: create two paths. Light builds use CPU (fast, cheap). Only when specific tests run, activate the GPU environment. (6) For ML training: containerize your training code. The container already has PyTorch, TensorFlow, CUDA. Spin it up with GPU support: `docker run --gpus all ...`. Overhead is minimal.

Follow-up: Design a GPU-accelerated workflow that minimizes environment setup overhead.

You want to use GPU runners for image processing tasks. GitHub Actions' GPU runners are limited (few concurrent instances). If 3 teams all request GPU runners simultaneously, they queue for hours. How do you manage GPU resource contention?

Implement GPU resource management: (1) Set concurrency limits: use concurrency groups to prevent all teams from running GPU jobs in parallel. Example: `concurrency: group: gpu-pool, cancel-in-progress: false`. Only one GPU job runs at a time across the org. Others queue. (2) Prioritize: for high-priority jobs (production releases), allow queue-jump. Implement job prioritization via external queuing service. (3) Batch schedule: instead of running GPU jobs on-demand, schedule them during off-peak hours (2-4 AM). Teams submit jobs, and they run in batches at night. (4) Use timeouts: `timeout-minutes: 60` prevents a stuck job from occupying GPU indefinitely. (5) Measure utilization: track GPU usage over time. If average utilization is <50%, GPU runners are over-provisioned; downsize. If >90%, upsize or implement throttling. (6) For unpredictable spikes: use auto-scaling. If queue depth exceeds 3 jobs, spin up additional GPU runners. After queue empties, shut them down. (7) Hybrid approach: (a) small/quick GPU jobs run immediately, (b) large jobs queue and run at night. (c) non-critical jobs use CPU approximation (not perfect but cheap). (8) For team fairness: implement per-team quotas. Each team gets 10 GPU hours/week. Enforce via custom authorization checks. (9) Consider cloud alternatives: GPU-as-a-service (AWS EC2 GPU instances, GCP, Azure) might offer better economics than GitHub's runners if usage is high.

Follow-up: Design a fair GPU resource allocation system across multiple teams.

Your team uses large runners for Docker image builds. Each build creates a 5 GB Docker image. GitHub's standard storage is limited. You're running out of disk space on runners. The build fails: "No space left on device." What's your strategy?

Manage disk space on large runners: (1) Clean up between builds: add a cleanup step that removes old images, build cache, and temporary files. `docker system prune -a --volumes` removes all images and dangling volumes. (2) For Docker images: don't store multiple versions on the runner. Build, push to registry (ECR, Docker Hub), then delete the local image. (3) Use multi-stage builds to minimize final image size. Intermediate layers are discarded. (4) Compress images: use smaller base images (alpine instead of ubuntu saves 500 MB). (5) For large runners: monitor disk usage in the workflow: `df -h`. If usage exceeds 80%, fail and alert. (6) Implement disk quotas: some runners support quotas. Set a limit so no single build consumes >50% of available disk. (7) For persistent storage needs: mount an external volume (EBS, NFS) instead of using runner's local disk. (8) In `action.yml` for custom actions: remove unnecessary files from the image before distribution. Don't ship test data, documentation, or development dependencies. (9) For CI workflows: after each stage, verify disk space hasn't ballooned. If a step creates unexpected large files, investigate and fix.

Follow-up: Design a disk space monitoring system for large runners that prevents out-of-space failures.

Your team migrated from standard runners to large runners for CPU-intensive builds. The builds are now faster, but GitHub's monthly bill tripled. You have no visibility into which workflows/teams are driving the cost. How do you implement cost tracking for large runners?

Implement cost attribution for large runners: (1) GitHub's billing dashboard shows cost by runner type but not by workflow. Use the GitHub API to get more detail: `gh api repos/{owner}/{repo}/actions/runs --head-ref={branch} --status=completed` returns run details. (2) Build a custom cost tracking system: for each workflow run, record: workflow name, runner type, duration, cost. (3) Use GitHub's runner labels to tag cost: workflows tag themselves with labels like `team:backend`, `cost-center:ml`. The billing system can then group costs by label. (4) Implement a "cost awareness" dashboard: show each team their monthly cost for runners. Display top cost drivers (most-expensive workflows, most-frequent workflows). (5) For cost control: set budgets per team. If a team exceeds budget, large runner access is disabled or requires approval. (6) For cost optimization: identify the most expensive workflows. Can they run on standard runners? Can they be optimized to run faster? (7) Implement reserved capacity: if a team needs 100+ large runner hours/month, consider negotiating a reserved rate (cheaper than on-demand). (8) For accountability: include cost in sprint metrics. Teams see the cost of their builds and have incentive to optimize.

Follow-up: Design a cost attribution and budgeting system for GitHub Actions runners.

Your organization has self-hosted large runners on AWS EC2. You have 4 powerful instances running 24/7 as failover capacity (in case primary runners fail). But they're idle 80% of the time. Cost: $2K/month wasted. How do you optimize?

Optimize idle capacity: (1) Use spot instances: instead of on-demand instances, use AWS Spot instances (70% cheaper). You lose them if AWS needs capacity, but for non-critical failover, it's acceptable. (2) Auto-scaling: instead of running 4 instances 24/7, scale based on queue depth. When no jobs are queued, scale down to 1 instance. When queue grows, scale up to 4. (3) Implement queue-depth monitoring: query GitHub's API for pending jobs. If queue-depth > 5, scale up. If queue-depth < 2, scale down. (4) Use scheduled scaling: predict when runners are needed. If builds peak at 9 AM, scale up at 8:45 AM. Scale down at 6 PM. (5) Multi-purpose runners: instead of large runners sitting idle, use them for other tasks: (a) scheduled backups, (b) nightly tests, (c) batch processing. (6) For failover: use a smaller instance as passive backup. It doesn't run jobs, just monitors. If primary fails, promote backup in ~1 minute (cheaper than 4 idle instances). (7) Cost analysis: at $2K/month wasted, even a part-time engineer spending 10 hours optimizing pays for itself in a month. (8) For critical availability: if failover capacity is essential, accept the cost but monitor utilization. Quarterly, review: "Did failover save us?" If not, reduce allocation.

Follow-up: Design an auto-scaling runner system that balances cost and availability.

You use GitHub's large runners for builds. A workflow accidentally creates a runaway process that consumes all CPU and RAM. The runner becomes unresponsive. The job hangs indefinitely. GitHub's automatic timeout might not work if the runner is completely blocked. How do you prevent this?

Implement resource limits and runaway protection: (1) Set workflow timeouts: `timeout-minutes: 60` ensures the job exits after 60 min, even if it's stuck. (2) Set job-level resource limits: GitHub Actions doesn't natively support resource limits, but you can use Linux cgroups (if using self-hosted runners). Example: limit a job to 50% CPU, 4 GB RAM via systemd-run. (3) Add health checks: during the job, periodically check CPU/memory usage. If usage exceeds threshold, kill the job: `if ps aux shows CPU>90%, kill -9 [pid]`. (4) Use process monitors: tools like `oomkiller` monitor memory. If memory is exhausted, the kernel kills the job automatically. (5) For runaway processes: set a per-process limit: `ulimit -m 4096` (4 GB max per process). (6) Implement per-job isolation: use containers for every job. Each container has resource limits. If a process runaway, only the container is affected, not the entire runner. (7) Add alerting: if a job runs >90% of its timeout (e.g., 54/60 minutes), alert ops. Investigate if the job is truly long or stuck. (8) Recovery: after a stuck job, the runner might be in a bad state. Implement auto-recovery: restart the runner agent every N hours or after every stuck job. (9) Documentation: warn users: "Set `timeout-minutes` appropriately. If your job regularly times out, it might be stuck. Investigate."

Follow-up: Design a resource-limited job execution environment that prevents runaway processes.

Your team uses GPU runners for ML model training. Training takes 2 hours. Every 30 minutes, you want to checkpoint the model (save progress) so that if the runner crashes, you can resume from the last checkpoint instead of restarting from scratch. How do you implement reliable checkpointing?

Implement fault-tolerant checkpointing: (1) In your training code, save state every 30 minutes: `if epoch % 10 == 0: save_checkpoint('model_epoch_10.pth')`. (2) Store checkpoints externally (not on the runner): upload to S3 or your artifact registry immediately after saving. Example: `aws s3 cp model_epoch_10.pth s3://my-bucket/checkpoints/`. (3) Before training starts, check for existing checkpoints: `latest_checkpoint = fetch_latest_checkpoint()`. If found, resume from it instead of from scratch. (4) Use a distributed training setup: some frameworks (PyTorch Lightning, Hugging Face) have built-in checkpointing. Use them for reliability. (5) For runner failures: implement a retry mechanism in the workflow. If the job fails, re-run it. The training code resumes from the last checkpoint. (6) Implement exponential backoff: if resuming from checkpoint fails repeatedly, alert and give up. (7) For production models: use a more robust system. Kubernetes with persistent volumes auto-resumes training on failure. Self-managed runners with local storage are less reliable. (8) Monitor checkpoint uploads: if the checkpoint upload to S3 fails, the job should fail (not continue as if checkpoint is safe). Verify: upload succeeded before continuing. (9) Test recovery: periodically, simulate a runner failure (kill the job after N minutes). Verify the training resumes correctly from the checkpoint.

Follow-up: Design a fault-tolerant training system with periodic checkpointing and recovery.