Jenkins Interview — Pipeline Durability and Restart/Resume

A 2-hour production deployment pipeline is running. Jenkins controller crashes unexpectedly. Build was at stage 10 of 20 (50% complete). On restart, build is lost and must restart from beginning. Implement pipeline durability to preserve execution state.

Implement pipeline durability: (1) Enable Pipeline durability: Jenkins > Configure System > Pipeline > Mode: "Maximum (resilience over throughput)". (2) This checkpoints pipeline state to disk every step. (3) On controller restart, Jenkins resumes build from last checkpoint. (4) For critical builds: use explicit checkpoint: `checkpoint 'pre-deployment'` creates savepoint. (5) Store state externally: serialize pipeline variables to S3/Git on each stage completion. (6) Implement step-level recovery: wrap stages in try-catch, resume on failure. (7) Use artifact caching: previous stage outputs cached, reusable on resume. (8) Implement idempotent stages: ensure re-running same stage produces same result (e.g., no duplicate API calls). (9) Use external state store: Redis/etcd tracks pipeline execution state. (10) Monitor durability: track successful resume vs re-run frequency. With durability: Jenkins crashes at stage 10 -> restarts -> resumes from stage 10 (saves 1 hour). Expected: 90%+ success on resume.

Follow-up: Pipeline resumes from checkpoint but external system (e.g., database) has changed state. How do you handle idempotency?

Your pipeline includes a long approval stage (waiting for human input). Build waits for 24 hours. If Jenkins controller crashes, approval input is lost, build cannot resume. Implement durable approval mechanism.

Implement durable approvals: (1) Store approval request externally: instead of in-memory queue, use database/Slack. (2) Use Jenkins input step with external backend: `input(message: 'Approve?', submitterParameter: 'approver', ok: 'Deploy')` persists to JENKINS_HOME. (3) Enable pipeline durability: approval state checkpointed to disk. (4) On controller crash: Jenkins resumes, approval remains pending. (5) Send approval reminders: email/Slack notification after 1 hour if not approved. (6) Implement approval timeout: if not approved in 24h, auto-reject. (7) Use external approval service: integrate with Jira/ServiceNow for approval tracking. (8) Implement fallback: if Jenkins down, approver can approve via external system, build resumes when controller recovers. (9) Track approval history: audit log shows who approved, when, from which system. (10) Implement escalation: if approval not completed by SLA, escalate to manager. Configuration: Jenkins input step already durable. To ensure persistence: Manage Jenkins > Configure System > Quiet Period > 0 (don't queue jobs while controller starting). For integration: webhooks from approval system (Slack/Jira) trigger resume.

Follow-up: Two approvers approve build simultaneously on different systems (Jenkins + Jira). How do you prevent duplicate approvals?

Your deployment pipeline updates production database, deploys code, then restarts services. Midway through, Jenkins crashes. On resume, deployment is half-done (database updated but code not deployed). Services restart on old code + new schema. Corruption occurs. Implement transactional deployments.

Implement transactional deployments: (1) Use database transactions: wrap DB updates in transaction, only commit after code deployed. (2) Implement two-phase commit: prepare phase (validate), commit phase (apply). (3) Use blue-green deployment: new version deployed to shadow environment, switched only after validation. (4) Implement rollback: if any step fails, rollback all previous steps. (5) Create pre-deployment checkpoint: snapshot infrastructure state before deployment. (6) Use Infrastructure-as-Code: Terraform apply all changes atomically, rollback on failure. (7) Implement state monitoring: after each step, validate system state. If inconsistent, trigger rollback. (8) Use canary deployment: deploy to 1% of traffic first, monitor for issues. (9) Store deployment manifest: track which changes applied, enable selective rollback. (10) Implement safety checks: pre-deployment validation ensures schema/code compatibility. Example: Jenkinsfile: `stage('Deploy') { steps { sh 'begin-transaction' ; sh 'deploy-code' ; sh 'restart-services' ; sh 'commit-transaction' } post { failure { sh 'rollback-transaction' } } }`. For durability: if crash during transaction, Jenkins resumes checkpoint before transaction started, re-runs entire deployment (transactional semantics ensure idempotency).

Follow-up: Rollback itself fails mid-execution. Infrastructure in inconsistent state. Recovery strategy?

Your pipeline uses Groovy closures and dynamic code that don't serialize well. When Jenkins checkpoints for durability, Groovy code can't be persisted. Pipeline fails to resume. Implement serializable pipeline patterns.

Implement serializable pipelines: (1) Avoid non-serializable objects: use primitives, Strings, Lists, Maps. (2) Use @Field annotation for Groovy vars: `@Field String var = 'value'` makes serializable. (3) Use Jenkins shared libraries: pre-compiled Groovy reduces serialization issues. (4) Implement stateless functions: avoid storing state in closures. (5) Use Jenkins pipeline DSL instead of Groovy: declarative pipeline is serializable. (6) Implement custom serialization: Groovy `writeReplace()`/`readResolve()` for custom objects. (7) Use transient keyword: mark fields that don't need serialization with `@transient`. (8) Avoid lambda expressions: use named functions or method references instead. (9) Test serialization: write pipeline vars to disk, verify readback. (10) Monitor serialization errors: Jenkins logs show which objects fail serialization. Example: Use declarative pipeline instead of scripted: `pipeline { stages { stage('Build') { steps { sh 'make' } } } }` is serializable. For dynamic logic: move to shared library, pre-compiled. If Groovy required: store state in external store (Redis), not in-memory closures.

Follow-up: A shared library function has non-serializable dependency. How do you refactor safely?

Your deployment pipeline runs 3 stages: build (10 min), test (20 min), deploy (5 min). If agent crashes during deploy, entire 35-minute pipeline must restart. Implement fine-grained stage resumption to skip completed stages.

Implement stage-level resumption: (1) Skip completed stages: Jenkins tracks which stages completed. On resume, skip to failed stage. (2) Use stage result tracking: `currentBuild.result` tracks overall status, individual stages track own status. (3) Implement conditional stage execution: `when { expression { return currentBuild.result == null || currentBuild.result == 'UNSTABLE' } }` skips if already passed. (4) Store stage artifacts: each stage produces artifact. On resume, check if artifact exists, skip stage. (5) Use build fingerprinting: Jenkins tracks file hashes. If inputs unchanged, skip rebuild. (6) Implement explicit skip logic: Jenkinsfile checks `if (fileExists('build/app.jar')) { return } // skip build`. (7) Use caching: stage outputs cached, resume uses cache. (8) Implement resume flags: environment variable `SKIP_BUILD=true` on resume. (9) Monitor stage skip rate: track how many stages skipped on resume. (10) Validate stage outputs: before skipping stage, verify output integrity. Example: `stage('Deploy') { when { expression { !fileExists('deployment.done') } } steps { sh 'deploy.sh && touch deployment.done' } }`. On resume, if `deployment.done` exists, stage skipped. This ensures build completes in 5 min on resume instead of 35 min.

Follow-up: A stage produces different output based on input (non-deterministic). Skipping might use stale artifact. How do you ensure correctness?

Your Jenkins instance with durable pipelines is forced to do hard restart (power failure). JENKINS_HOME checkpoint files are partially written and corrupt. Pipelines can't resume. Implement checkpoint validation and recovery.

Implement checkpoint resilience: (1) Use checksums: write CRC32/SHA256 checksum after each checkpoint. (2) Validate on load: verify checkpoint integrity before resume. (3) Implement rollback: if current checkpoint corrupt, load previous checkpoint. (4) Use write-ahead logging: log all state changes before committing to checkpoint. (5) Implement atomic writes: use file moves instead of overwrites (atomic on most filesystems). (6) Backup checkpoints: maintain rolling backup of recent checkpoints (last 5 versions). (7) Use external checkpoint store: S3/Consul for durable checkpoint storage. (8) Monitor checkpoint health: alert if validation fails. (9) Implement recovery mode: if checkpoint corrupt, restart pipeline from last known-good checkpoint. (10) Use ECC storage: if on-prem, use RAID with ECC to detect/correct corruption. Example: After crash, Jenkins finds corrupt checkpoint. Loads previous checkpoint (5 min old). Resumes from there. Loses only 5 min of progress. For durability: use raft-based consensus across 3 Jenkins instances, checkpoint replicated to all. Corruption on one node doesn't affect others.

Follow-up: All recent checkpoints are corrupt. Only old checkpoint from 1 hour ago is valid. Recovery options and data loss?

Your multi-stage pipeline checks out Git, builds artifact, deploys to 3 environments (dev, staging, prod). If deploy to dev fails, prod deployment must wait or be skipped. On resume, how do you enforce dependency ordering?

Implement stage dependencies: (1) Use explicit stage ordering: `dependsOn` directive specifies dependencies. (2) Implement conditional execution: `when { expression { env.DEV_DEPLOY_SUCCESS == 'true' } }`. (3) Use try-catch to track success: each stage sets environment variable on success. (4) Implement explicit gates: `input 'Proceed to prod?'` manual gate between stages. (5) Track stage results: Jenkins tracks failed/skipped/passed stages. (6) On resume: re-evaluate conditions, skip stages whose dependencies failed. (7) Use stage result flags: `currentStage.result == 'SUCCESS'` determines if next stage runs. (8) Implement fallback: if dev deploy fails, skip staging/prod, notify teams. (9) Implement multi-path execution: allow dev failure but continue testing in parallel. (10) Monitor dependency violations: alert if dependent stage executed before dependency passed. Example: Jenkinsfile with dependencies: `stage('Deploy Dev') { ... }`; `stage('Deploy Staging') { when { expression { currentBuild.result == null } } ... }`; `stage('Deploy Prod') { when { expression { currentBuild.result == null && env.STAGING_SUCCESS == 'true' } } ... }`. On resume: Jenkins evaluates conditions, skips any stages whose dependencies didn't pass.

Follow-up: A developer manually approved staging deploy without dev passing. Enforcement mechanism?