GitHub Actions Interview — Workflow Dispatch and Manual Triggers

Your team needs to manually deploy code to production on-demand (outside of automatic CI/CD). For example: deploy a hotfix at 3 AM, or rollback to a previous version. You want a UI to trigger deployments without SSH-ing into infrastructure. How do you set this up?

Use GitHub's `workflow_dispatch` event: (1) Create a workflow that triggers manually: `on: workflow_dispatch`. (2) Add inputs to customize the deployment: `inputs: version: description: 'Version to deploy' default: 'latest', environment: description: 'Environment' options: [staging, prod]`. (3) The workflow now appears in the GitHub UI: Actions tab → select workflow → "Run workflow" button. Enter version and environment, click Run. (4) Inside the workflow, use the inputs: `jobs: deploy: runs-on: ubuntu-latest steps: - run: echo "Deploying ${{ github.event.inputs.version }} to ${{ github.event.inputs.environment }}"`. (5) Requirements: (a) inputs are strings by default. For booleans or complex types, handle them in the workflow. (b) default values are optional but recommended (helps users). (c) inputs appear in the GitHub UI as a form—less technical than CLI. (6) For approval: add a workflow step that requires manual approval before deployment. Use a GitHub environment with protection rules: `environment: prod` + `deployment-branches: [main]` + `required-approvers: [team]`. (7) This is safer than giving all engineers SSH access to production. Deployment is audited, logged, and can be restricted to certain teams/individuals.

Follow-up: Design a workflow dispatch UI that gathers complex deployment parameters from users.

You set up a `workflow_dispatch` deployment. An engineer manually triggers a deployment at 3 AM. The deployment succeeds, but the engineer forgets to notify the team. The next morning, another engineer re-deploys without knowing the first deployment happened, causing version conflicts and data corruption.

Manual triggers need coordination: (1) Add notifications: when a deployment triggers, post to Slack/Teams immediately: `- name: Notify Slack run: curl -X POST ${{ secrets.SLACK_WEBHOOK }} -d '{"text": "Deployment triggered: version=${{ github.event.inputs.version }}"}'`. (2) Add approval gates: require a second person to approve deployments. Use GitHub's environment protection rules: `required-reviewers: [team]` + `wait-timer: 5` (5-minute approval window). (3) Implement deployment locks: before deploying, check if a deployment is already in progress. If yes, block it. Store deployment state in a file or database: `deployed_at=, deployed_by=, status=in_progress`. (4) Add an audit log: log all manual deployments to a database or log file. Include: who, when, what version, result. Later, query the log to check what's deployed. (5) Use a deployment dashboard: show the current deployed version per environment. Before deploying, engineers check the dashboard. (6) For critical deployments: require more ceremony. Instead of just clicking "Run workflow," have a Slack command that gathers context: `/deploy version=1.0.0 environment=prod reason='hotfix for X'`. The command stores this context before triggering the workflow. (7) Implement a "deployment cooldown": if a deployment succeeded, don't allow another deployment to the same environment for 30 minutes. This prevents double-deployments. (8) For on-call workflows: designate one on-call engineer responsible for deployments during their shift. They're the only one allowed to trigger deployments.

Follow-up: Design a deployment notification and coordination system for manual triggers.

You have a `workflow_dispatch` workflow that requires 2 approvals (from a team of 5 engineers). An engineer triggers the workflow. The UI shows "Awaiting approval from reviewers." But you don't have visibility into who needs to approve, how many have approved, or when approvals will happen. How do you add transparency?

Enhance visibility into approval workflow: (1) Use a custom approval system: instead of relying on GitHub's environment protection rules (which lack transparency), create a custom workflow that orchestrates approvals. Example: the initial deployment workflow posts a message to Slack with "Approve" and "Deny" buttons. Reviewers click, and the workflow continues or stops. (2) Use a GitHub App or Actions for approval: `trstringer/manual-approval@v1` allows custom approval gates within workflows. (3) Implement a reviewer dashboard: query the GitHub API to show pending approvals, required reviewers, and approval history. Display in Slack or a web dashboard. (4) For transparency, post periodic updates: if waiting for approval >5 minutes, post in Slack: "Deployment pending approval. Needed from: [List of required reviewers]." (5) Use GitHub's Deployment API: create a deployment with status=pending, then query it to show status. (6) For time-based escalation: if approval is needed for >30 minutes, escalate to the manager or on-call engineer. (7) Implement a veto system: after 2 approvals, a third person has veto power for 5 minutes (in case of a mistake). This prevents hasty deployments. (8) Log everything: store approval metadata (who approved, when, from which device/IP) for compliance and debugging.

Follow-up: Design a sophisticated approval system with transparency, escalation, and audit trails.

You have a `workflow_dispatch` workflow for manual deployments. An engineer accidentally triggers it with the wrong parameters (e.g., deploy v1.0.0 to prod instead of v1.1.0). By the time they notice (30 seconds later), the deployment is halfway done. They want to cancel it immediately, but GitHub doesn't allow mid-run cancellation of deployments.

Implement safeguards against accidental triggers: (1) Add a confirmation step: before starting the deployment, the workflow waits for a second confirmation. Example: `jobs: confirmation: runs-on: ubuntu-latest steps: - run: echo "Confirm deployment? (reply within 5 minutes)"`. This gives time to abort. (2) Use a pre-deployment validation step: before making any changes, validate inputs: `if [[ "${{ github.event.inputs.version }}" != "v1."* ]]; then echo "Invalid version"; exit 1; fi`. (3) Add a dry-run mode: include an optional input `--dry-run`. If set, the workflow simulates the deployment but doesn't actually change anything. (4) Require explicit confirmation in the UI: instead of a single "Run workflow" button, show a checkbox: "I confirm I'm deploying v1.0.0 to prod" that must be ticked. (5) For critical deployments: implement a phone call or Slack message confirmation. The workflow sends a message to the engineer: "You triggered deployment of v1.0.0 to prod. React with ✓ to confirm, ✗ to abort." (6) If the user can't confirm within 5 minutes, the workflow auto-aborts. (7) Use semantic versioning validation: if the input is a version, verify it follows semver. If it's invalid, fail immediately. (8) Add a rollback job: if deployment fails or is cancelled mid-way, automatically roll back to the previous version. This minimizes damage from mistakes. (9) For protection: require high-risk deployments to go through staging first. Deploy to staging (low impact), test, then promote to prod.

Follow-up: Design a safe-by-default manual deployment system with confirmations and rollback.

You have multiple `workflow_dispatch` workflows: deploy-api, deploy-web, deploy-db. They need to run in sequence (deploy DB first, then API, then web). An engineer manually triggers all three, but they start in parallel, causing conflicts. The DB migration hasn't finished by the time the API tries to start.

Orchestrate sequential manual deployments: (1) Create a "master" workflow that takes a single input (e.g., version) and triggers the three sub-workflows in sequence. The master workflow calls each sub-workflow and waits for completion before starting the next: `jobs: deploy-db: uses: ./.github/workflows/deploy-db.yml; deploy-api: needs: deploy-db uses: ./.github/workflows/deploy-api.yml; deploy-web: needs: deploy-api uses: ./.github/workflows/deploy-web.yml`. (2) Use job dependencies: `needs: [deploy-db]` ensures deploy-api waits for deploy-db to complete. (3) For workflow_dispatch calling other workflow_dispatch workflows: use `workflow_run` trigger: when the master workflow completes successfully, it can trigger the next workflow. (4) Add status checks: after each step, verify success before continuing: `if: job.status == 'success'` before starting the next job. (5) Implement atomic deployments: if any step fails, automatically rollback all previous steps. (6) For UI simplicity: expose a single button to engineers ("Deploy All"). Internally, the master workflow coordinates the sequence. (7) Add a dependency graph: visualize which deployments are in progress, which are queued, which failed. Display in GitHub or a custom dashboard.

Follow-up: Design an orchestration system for complex multi-step manual deployments.

Your team uses `workflow_dispatch` for manual deployments. Every deployment goes through Slack notifications and approvals. But sometimes, the Slack bot is down, notifications fail, and approvals get lost. The workflow assumes notifications succeeded but they didn't. How do you handle failure in dispatch workflows?

Add error handling and retries: (1) For Slack notifications: wrap in a retry loop. Use an action like `nick-invision/retry@v2` to retry 3 times before giving up. (2) For approvals: if Slack is down, fall back to a different method. GitHub has a native approval mechanism (`trstringer/manual-approval`). If Slack fails, use GitHub's UI. (3) Use `continue-on-error: true` for non-critical steps: notifications are nice-to-have, not critical. If Slack fails, the deployment should continue, but log a warning. (4) For critical steps (database migration, traffic switching), fail if they error. Don't use `continue-on-error`. (5) Implement a status check: after the workflow completes, verify it succeeded end-to-end. If any step failed, send an alert: "Deployment completed with errors. Please review." (6) Add idempotency checks: if a deployment partially succeeded, re-running it should be safe. Check: is the version already deployed? If yes, skip it. (7) Use a "health check" after deployment: the workflow queries the deployed service to confirm it's healthy. If health check fails, automatically rollback. (8) For ops debugging: log all workflow steps to a database. Later, teams can query: "What happened during the deployment at 3 AM?" This helps with post-mortems.

Follow-up: How would you implement comprehensive error handling and recovery for manual deployment workflows?

Your team uses `workflow_dispatch` for manual deployments. Currently, only repo collaborators (25+ people) can trigger deployments. You want to restrict it to the on-call engineer only (1 person). GitHub's permission model doesn't natively support this—collaborators are all-or-nothing. How do you restrict manual triggers to specific people?

Implement custom authorization: (1) GitHub's native RBAC doesn't support "only this user can run workflow_dispatch." You need a workaround. (2) Option 1: Use a GitHub App with custom logic. The app receives the workflow_dispatch event and checks: "Is the triggering user in the on-call rotation?" If not, post a comment: "Only the on-call engineer can deploy. Current on-call: [name]." (3) Option 2: Check inside the workflow. Add a step at the start: `if [[ "${{ github.actor }}" != "on-call-engineer" ]]; then echo "Not authorized"; exit 1; fi`. This fails the workflow if the wrong person triggers it. (4) Option 3: Use a Slack command instead of GitHub UI. Create a Slackbot that: (a) receives `/deploy` command, (b) checks if the user is on-call (query an on-call schedule), (c) if yes, calls GitHub's API to trigger the workflow. (d) Only the bot can trigger the workflow. (5) Option 4: Use environment protection rules + branch restrictions. Deployments to "production" environment require approval from the on-call engineer. (6) For on-call rotations: integrate with PagerDuty or Opsgenie. Query the API to get the current on-call engineer. Use this in your authorization checks. (7) Audit: log all deployment attempts (authorized and unauthorized). This helps with compliance.

Follow-up: Design an on-call integration system that restricts deployments to the current on-call engineer.

Your team uses `workflow_dispatch` for manual database maintenance tasks: backups, schema migrations, cleanup. An engineer accidentally triggers a database cleanup workflow on production instead of staging. The task deletes 10 GB of data. How do you prevent this disaster?

For destructive operations, implement safeguards: (1) Add a mandatory input requiring explicit confirmation. Instead of a simple yes/no, require the engineer to type "DELETE PRODUCTION DATA" to proceed: `inputs: confirmation: description: 'Type "DELETE PRODUCTION DATA" to proceed'`. In the workflow, verify: `if [[ "${{ github.event.inputs.confirmation }}" != "DELETE PRODUCTION DATA" ]]; then exit 1; fi`. (2) Use two-factor approval: one engineer triggers, another approves. Both must agree before destructive operations proceed. (3) Add a safety delay: before executing the destructive task, wait 5 minutes. Display: "In 5 minutes, we will delete 10 GB of data from production. Cancel now if this is a mistake." Give the engineer time to react. (4) Use a dry-run first: the workflow has two modes. Step 1: `--dry-run` simulates the operation, shows what will be deleted, but doesn't actually delete. Step 2: after review, trigger with `--confirm`. (5) Implement backups: before any destructive operation, automatically create a backup. If the operation fails, restore from backup. (6) Use read-only replicas: for data that's rarely changed, use read-only replicas. The destructive operation runs on a non-critical copy first (for testing), then on production if it's verified. (7) For database maintenance: prefer using feature flags or configuration instead of deletion. Mark data as "archived" instead of deleting it. If a mistake happens, unarchive it. (8) Have a "break glass" recovery procedure: if a disaster happens, document the recovery steps. Practice them monthly to ensure they work.

Follow-up: Design a multi-stage destructive operation workflow with safeguards and recovery procedures.