Your CI workflow deploys to a single staging environment. Two engineers push commits within 1 second of each other. Both trigger deploy workflows simultaneously. Both workflows try to deploy to the same staging server at the same time, causing corrupted deployments and database lock timeouts. The staging environment is unusable for 30 minutes. How do you prevent this?
Use concurrency groups to serialize deployments. Add to your workflow: `concurrency: group: deploy-staging, cancel-in-progress: false`. This ensures only one deployment runs at a time to staging. If a second deployment is triggered while the first is running, it queues and runs after. Set `cancel-in-progress: false` so in-progress deployments finish before the next one starts. For stricter control, use a concurrency key that includes the branch: `group: deploy-${{ github.ref }}`. This allows parallel deployments on different branches (dev branch deployments don't block main branch deployments), but serializes deployments within the same branch. For multiple environments (staging, prod-us, prod-eu), use separate concurrency groups: `group: deploy-${{ inputs.environment }}`. Benefits: (1) prevents resource contention, (2) no database locks or corrupted state, (3) predictable deployment order. Downside: deployments queue, so if 5 developers push simultaneously, the last one waits ~4× the deployment time. To mitigate, encourage small, frequent commits (faster deployments) rather than large batch commits (long queue times).
Follow-up: How would you implement a concurrency strategy that prioritizes critical deployments (bug fixes) over normal deployments?
You set up concurrency with `cancel-in-progress: true` to allow new pushes to cancel old builds. An engineer pushes a commit, the build starts (will take 45 minutes). 10 seconds later, they realize they made a typo and push a fix. The old build is cancelled. But the old build had already deployed to production (the first 5 minutes of the 45-minute build). Now you have a half-deployed state in production.
This is a critical issue with `cancel-in-progress: true`—it's too aggressive when deployments are involved. Best practice: (1) never use `cancel-in-progress: true` for deployment workflows. Deployments should complete even if new commits arrive. Reserve `cancel-in-progress: true` for build/test workflows. (2) For build workflows, it's safe: if a new commit arrives, cancel the old test run and start fresh on the new code. (3) Segment your workflow: `concurrency: group: test-${{ github.ref }}, cancel-in-progress: true` for build jobs + `concurrency: group: deploy-${{ github.ref }}, cancel-in-progress: false` for deployment jobs. This allows cancelling old test runs but prevents cancelling deployments. (4) If you do want cancellation for deployments (rare), make sure all jobs are truly independent: don't start a deployment mid-way. Only cancel before deployment starts. (5) Use a status check: before allowing a deployment to run, verify that the previous deployment completed successfully. If it didn't, block the new deployment and alert the on-call team.
Follow-up: Design a workflow where only in-progress builds are cancelled, not deployments, even if new commits arrive.
Your team has a monorepo with 20 services. Each service has its own CI pipeline that deploys independently. You set up `concurrency: group: deploy` globally for the entire repo. Now, any push to any service serializes all deployments through a single queue. Service A deploys for 30 minutes, blocking Service B's 2-minute deployment. This is frustrating engineers. How do you fix this?
Use context-specific concurrency groups instead of a global group: (1) Include the service name in the concurrency key: `group: deploy-${{ matrix.service }}` (if using matrix) or `group: deploy-${{ env.SERVICE_NAME }}`. This creates separate queues per service. (2) Service A's deployments queue separately from Service B's, so they can deploy in parallel. (3) If you want to prevent deploying to the same service in parallel, but allow different services in parallel, this approach works perfectly. (4) For shared resources (e.g., a single database, a single API gateway), you still need coordination. In that case: (a) use a separate concurrency group for those operations only: `group: shared-db-migration`, (b) only critical database migrations need serialization, not all deployments, (c) optimize shared resource operations to be fast (<1 min). (5) Alternative: use a workflow orchestrator (external lock service) instead of GitHub's concurrency. This gives more control: "allow up to 5 concurrent service deployments, but serialize database migrations." (6) Document: "Each service deploys independently and in parallel. Shared resources (database) are protected by locks and migrate serially."
Follow-up: Design a concurrency strategy for a monorepo where services deploy in parallel but share critical resources that must be updated serially.
You set up a concurrency group for deployments. A workflow is queued, waiting for the previous deployment to finish. The previous deployment is now taking 2 hours (unexpectedly slow). The queued workflow is waiting indefinitely. You need to manually cancel it because GitHub doesn't provide ETA or timeout feedback. How do you handle queued workflows better?
Implement monitoring and alerting for concurrency queues: (1) Set a workflow-level timeout: `timeout-minutes: 120`. If a job runs >2 hours, GitHub cancels it. (2) Set a concurrency group timeout: unfortunately, GitHub doesn't natively support timeouts on concurrency queues. Workaround: after queueing, a monitoring job polls the GitHub Actions API every 5 minutes to check queue depth. If a job is queued >30 minutes, alert the team (manually cancel or investigate why the previous job is stuck). (3) Implement a queue status dashboard: use the GitHub API to query workflow status, display it in Slack/status page. Include estimated wait time based on average deployment duration. (4) Add manual abort: expose a button in your CI/CD dashboard to manually cancel queued workflows. (5) Implement auto-abort: if a queued workflow is waiting >30 minutes, automatically cancel it with a notification: "Deployment queued too long; manual trigger required." (6) Optimize deployments to be fast: if most deployments are <5 minutes, queuing doesn't hurt much. If they're slow (>30 min), fix the root cause—deployments should be fast and predictable. (7) For long deployments, consider async patterns: trigger the deployment but don't wait in the workflow; instead, poll status separately.
Follow-up: Design a monitoring system that alerts if workflow jobs are queued for >20 minutes.
Your team uses concurrency to serialize deployments to prod. An engineer commits a fix, the deployment queues. Before it runs, another engineer notices a bug in the fix and wants to bump it from the queue. But GitHub's concurrency doesn't support priority—jobs are FIFO. The queued fix needs to wait for the currently-running deployment to finish, even though a newer fix exists. How do you handle priority?
GitHub's native concurrency is FIFO—no priority support. Workarounds: (1) Use `cancel-in-progress: true` sparingly: if new fix is critical and old deployment is low-priority, cancel the old one and start the new one. But this requires manual coordination (email the team, get approval). (2) Better: use branch-based concurrency: `group: deploy-${{ github.ref }}`. Main branch deployments have one queue, feature branches have separate queues. Prioritize: always merge hotfix branches to main first, ensuring they deploy before other features. (3) For true priority, use an external queuing service (Redis queue with job priority, AWS SQS with FIFO + priority attributes). The workflow publishes to the queue instead of running inline. A separate service processes the queue respecting priorities. (4) Implement a manual override: a deploy-now command that admins can run (via webhook or dashboard) that immediately cancels the current deployment and runs the new one. Reserve this for emergencies. (5) Design for fast deployments: if each deployment is <5 minutes, queuing feels acceptable. If >20 minutes, users will complain about wait times. Optimize deployments first.
Follow-up: Design a priority-based deployment queuing system using Redis.
You use concurrency for deployments: `group: deploy, cancel-in-progress: false`. A developer accidentally commits to main 50 times in rapid succession (git force-push, then re-push correctly). Each commit triggers a workflow. You now have 50 deployments queued, scheduled to run sequentially over the next 8 hours. Only the latest commit should deploy; the others are wasted. How do you prevent this?
Use a combination of `cancel-in-progress: true` for the queue + targeted deploy-only jobs: (1) Use `concurrency: group: test-${{ github.ref }}, cancel-in-progress: true` for test jobs—new commits cancel old test runs. (2) Use `concurrency: group: deploy-${{ github.ref }}, cancel-in-progress: false` for deployment jobs—deployments queue, but only one runs. (3) In the deploy job, add a check: `if: github.event.before != github.event.after` (only deploy if code actually changed, not on duplicate commits). (4) Better: use a deploy gate: before deploying, verify that no newer commits exist on the branch. Query the GitHub API: `git rev-parse origin/main` and compare to current SHA. If newer commits exist, skip deployment. (5) Alternatively: cancel queued deployments if newer commits arrive. Use `concurrency: group: deploy, cancel-in-progress: true` on the deploy job, but be careful—this requires the deployment to be truly idempotent (safe to cancel mid-run). (6) For this scenario (50 accidental commits): implement pre-commit hooks that prevent force-push to main, or require code review before merge. This prevents the root cause. (7) Rate-limiting: if >10 workflows queue in <1 minute for the same branch, alert the team—something is wrong.
Follow-up: How would you implement a smart deploy gate that cancels queued deployments when newer code is pushed?
Your team has a deployment workflow with concurrency. A workflow is running. The same branch gets a new push. Due to `cancel-in-progress: false`, the new run queues. But the developer who pushed the new code doesn't know a deployment is queued—they continue coding and push again, not realizing their previous push is blocked. You need better visibility into queued workflows.
Implement workflow status notifications: (1) In the workflow, detect when a run is created but queued (by checking concurrency group status). If queued, post a notification to Slack: "Your deployment is queued. Position: 3/5. Estimated wait: 15 minutes." (2) GitHub Actions doesn't provide native queue-depth detection, so use the REST API to poll: `gh api repos/{owner}/{repo}/actions/runs --head-ref={branch} --status=queued`. Parse the response to show queue position. (3) Use GitHub's Deployment status API: create a deployment with status=waiting when queued, status=in_progress when running. This shows in the PR UI. (4) Integrate with your CI/CD dashboard: display a real-time queue visualization. Teams can see: current deployment, queued jobs, estimated wait times. (5) Send proactive alerts: if a job is queued >30 minutes, auto-comment on the PR: "Your deployment has been waiting 30 minutes. Please investigate." (6) For developers: `gh run list --branch main` shows recent runs; `gh run view [run-id]` shows details including queued status. Make this convenient—add a GitHub CLI shortcut. (7) Use GitHub's built-in notifications: enable workflow notifications so devs get real-time Slack alerts when their jobs queue/run/complete.
Follow-up: Design a real-time dashboard showing workflow queue status for all active branches.
You have a complex multi-stage deployment with Build → Test → Deploy-Staging → Deploy-Prod. Each stage uses concurrency to prevent overlaps. A build job runs for 2 hours. A test job (dependent on the build) queues for 1 hour waiting for the build to finish. A deploy job queues for another hour. The total pipeline now takes 4+ hours, and developers are frustrated with slow feedback. What's your architecture fix?
Your concurrency strategy is too strict. Recommendations: (1) Don't serialize builds—they're independent. Remove concurrency from build jobs: let all builds run in parallel. (2) Tests can run in parallel too, as long as they don't modify shared state. Only serialize tests if they modify a database or shared resource. (3) Use concurrency only for deployments to shared environments (staging, prod). (4) Reorder the DAG: build in parallel, test in parallel, then deploy serially to staging (concurrency: group: deploy-staging), then sequentially deploy to prod environments. (5) Optimize build/test time: if build is 2 hours, investigate why. Multi-stage builds, better caching, parallel test runners can reduce this to 20 minutes. (6) Use fast-path deployments: deploy a quick smoke test (5 min) to staging, get feedback, then deploy full suite (30 min). This gives fast feedback without waiting for everything. (7) Implement canary deployments: deploy to 1% of prod servers first (auto), monitor for errors, then deploy to rest. This is faster than waiting for full pre-prod tests.
Follow-up: Design a multi-stage pipeline that maximizes parallelism while protecting shared resources with targeted concurrency.