Kubernetes Interview Questions

Secrets Management and Encryption at Rest

questions
Scroll to track progress

Your etcd backup (10GB) containing your entire Kubernetes cluster state was accidentally uploaded to S3 public bucket. You have 50,000+ Kubernetes Secrets in your cluster (API keys, database passwords, TLS certificates). An attacker has the backup. Answer immediately: Are your secrets compromised? How do you respond? What's your investigation and remediation timeline?

Critical question: Were secrets encrypted at rest in etcd? The answer determines if you have a catastrophic incident or a containable breach.

Phase 1: Immediate triage (0-5 minutes)
1. Check if etcd encryption was enabled: kubectl get configmap -n kube-system encryption-config -o yaml # If not found, encryption was NOT enabled = secrets are in plaintext in the backup

  1. Check kube-apiserver flags: kubectl describe pod -n kube-system kube-apiserver-* | grep -E 'encryption|auth'

Look for: --encryption-provider-config flag

If encryption is NOT enabled (likely scenario): CRITICAL INCIDENT
All secrets are in plaintext in the etcd backup:

  • Database credentials
  • API keys (AWS, GitHub, Stripe)
  • OAuth tokens
  • TLS certificates
  • Service account tokens

Attacker can:

  1. Immediately access all external services (cloud APIs, databases, SaaS)

  2. Impersonate any service account

  3. Decrypt traffic between services

  4. Escalate to full cluster compromise

    Phase 2: Emergency response (5-30 minutes)
    ASSUME FULL COMPROMISE. Do this in parallel:

  5. Revoke all secrets immediately: for secret in $(kubectl get secrets -A -o jsonpath='{.items[*].metadata.name}'); do

If it’s a credential (api-key, password, token), mark it revoked

echo "REVOKE: $secret" >> /tmp/compromised-secrets.txt done

  1. Rotate ALL external credentials: a) Database passwords:

    • Change all DB passwords
    • Update connection strings in ConfigMaps
    • Restart affected pods b) API keys:
    • Revoke old keys in Stripe, AWS, GitHub, etc.
    • Generate new keys
    • Update Kubernetes Secrets
    • Restart pods consuming these keys c) OAuth tokens:
    • Invalidate tokens at OAuth provider
    • Generate new tokens d) TLS certificates:
    • If CA private key was in the backup, certificates are compromised
    • Issue new certificates with new CA
    • This is EXPENSIVE but necessary
  2. Roll service account tokens: kubectl create serviceaccount temp-sa -n production

Copy permissions from old SA to temp-sa

Update all pod specs to use temp-sa

Eventually delete old SAs

  1. Monitor for suspicious activity: kubectl logs -n kube-system kube-apiserver-* --tail=500 | grep -E 'unauthorized|forbidden|token.*invalid' Check cloud provider (AWS) for unusual API calls Check databases for unauthorized access

    Phase 3: Investigation (30 minutes - 2 hours)
    1. Analyze the etcd backup to determine exposure:

Extract secrets from the backup (simulating attacker’s view)

etcdctl --endpoints=file:///backup/etcd-snapshot snapshot restore --data-dir=/tmp/etcd-test

Query the restored etcd for secrets

etcdctl get --from-key '' /kubernetes.io/secrets/ | grep -E 'password|token|key' | wc -l

Count: how many secrets were exposed?

  1. Classify exposed secrets:
  • High: Database credentials, API keys (can’t easily rotate)
  • Medium: OAuth tokens, temporary keys (can be rotated)
  • Low: TLS certs (can be replaced with new CA)
  1. Determine impact per service: For each exposed credential:
  • Which services use it?
  • What damage could attacker do?
  • How quickly can we rotate it?
  • Is it used in critical path?
  1. Check logs for actual compromise: kubectl logs -A --all-containers=true --since=2h | grep -E 'auth.*failed|unauthorized.*attempt|suspicious'

Did attacker actually access something, or just had the file?

Phase 4: Remediation (2-8 hours)
Priority 1: Rotate HIGH-risk secrets (database creds, production API keys)

  1. Create new secret in Kubernetes: kubectl create secret generic db-credentials-new
    –from-literal=password=$(openssl rand -base64 32)
    -n production

  2. Update deployments to use new secret: kubectl patch deployment app -n production -p
    '{"spec":{"template":{"spec":{"containers":[{"name":"app","env":[{"name":"DB_PASSWORD","valueFrom":{"secretKeyRef":{"name":"db-credentials-new"}}}]}]}}}}'

  3. Restart pods: kubectl rollout restart deployment/app -n production

  4. Update the external service (database, AWS account, etc.):

DB: ALTER USER app_user IDENTIFIED BY 'new-password';

AWS: Create new IAM access key, revoke old one

Stripe: Create new API key, revoke old one

Priority 2: Rotate MEDIUM-risk secrets (OAuth tokens, session keys)

  • Similar process to Priority 1, but less critical path impact

Priority 3: Replace TLS certificates

  • This is expensive (requires new CA if private key was exposed)
  • Can be done in parallel with other rotations
  • Rolling update of TLS secrets to all affected pods

    Phase 5: Long-term hardening
    Enable encryption at rest in etcd (SHOULD HAVE BEEN DONE):

  1. Generate encryption key: openssl rand -base64 32 > /etc/kubernetes/encryption-key

  2. Create encryption config: apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:

  • resources:
    • secrets providers:
    • aescbc: keys:
      • name: key1 secret:
    • identity: {} # Fallback to unencrypted (for graceful migration)
  1. Update kube-apiserver: –encryption-provider-config=/etc/kubernetes/encryption-config.yaml –encryption-provider-config-automatic-reload=true

  2. Restart kube-apiserver: kubectl rollout restart deployment/kube-apiserver -n kube-system

  3. Re-encrypt existing secrets: kubectl get secrets --all-namespaces -o json |
    kubectl apply -f -

This forces etcd to re-encrypt all existing secrets with the new key

Now, if etcd backup is exfiltrated again, secrets are encrypted:
etcdctl snapshot info /backup/etcd-snapshot

Shows data, but secrets are encrypted

Attacker has only encrypted blobs, not plaintext credentials

Phase 6: Backup security hardening
1. Encrypt backups: etcdctl snapshot save /backups/etcd-$(date +%s).db --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/etcd.pem --key=/etc/etcd/etcd-key.pem

gpg --symmetric --cipher-algo AES256 /backups/etcd-*.db

Encrypt backup with GPG

  1. Restrict backup storage:
  • Private S3 bucket with encryption-at-rest
  • No public access
  • Restricted IAM roles for access
  • MFA required for bucket access
  1. Audit backup access: aws s3api get-bucket-logging --bucket etcd-backups

Enable S3 access logging

Monitor for unexpected downloads

  1. Implement backup versioning and retention:
  • Keep multiple versions (can’t delete all at once)
  • Delete old backups after 90 days (or your retention policy)
  • This limits exposure window if backup is found

    Phase 7: Post-incident checklist
    [] All HIGH-risk secrets rotated

[] All MEDIUM-risk secrets rotated [] TLS certificates replaced (if private key was exposed) [] Encryption at rest enabled in etcd [] etcd backups now encrypted [] Backup storage hardened [] No suspicious activity detected in logs (24h monitoring) [] Team trained on secret management [] Post-mortem scheduled (review how backup got leaked)

Follow-up: How would you detect if an attacker actually used the compromised secrets? Design a monitoring and alerting strategy for anomalous credential usage.

You're implementing EncryptionConfiguration in Kubernetes to encrypt secrets at rest in etcd. A bug in your migration causes the system to mark all secrets as "unencrypted" even after re-encryption. You realize that 30,000 new secrets created post-migration are not actually encrypted. How do you detect this, quantify the damage, and fix it?

This is a subtle but critical bug: encryption configuration exists, but secrets aren't actually being encrypted. The system has no way to know which secrets are encrypted and which aren't.

Phase 1: Detection
1. Check encryption configuration: kubectl get configmap -n kube-system encryption-config -o yaml # Verify encryption-provider-config is set

  1. Verify kube-apiserver is using encryption: ps aux | grep kube-apiserver | grep encryption-provider-config

Should see flag

  1. Check etcd for unencrypted data: etcdctl get /kubernetes.io/secrets/default/my-secret

If you can read plaintext values, secrets are NOT encrypted

Expected encrypted output: binary gibberish (AES-CBC encrypted) Actual (unencrypted): readable JSON with plaintext values

Phase 2: Quantify the damage
1. Determine which secrets are unencrypted: kubectl get secrets -A -o json |
jq -r '.items[] | "(.metadata.namespace)/(.metadata.name) (.metadata.creationTimestamp)"' > /tmp/all-secrets.txt

  1. Compare creation timestamp to migration date: MIGRATION_DATE="2024-04-05T00:00:00Z" grep -v "$(echo $MIGRATION_DATE | cut -d’T’ -f1)" /tmp/all-secrets.txt > /tmp/post-migration-secrets.txt wc -l /tmp/post-migration-secrets.txt

Count: ~30,000 unencrypted secrets

Phase 3: Root cause analysis
1. Check encryption provider logs: kubectl logs -n kube-system kube-apiserver-* | grep -i encrypt | tail -50

  1. Common bugs: a) Encryption key not loaded properly:

    • Check key file exists: ls -la /etc/kubernetes/encryption-key
    • Check file permissions: stat /etc/kubernetes/encryption-key

    b) Identity provider is primary (should be fallback): apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:

    • resources:
      • secrets providers:
      • identity: {} # BUG: This is first, so secrets are NOT encrypted
      • aescbc: keys:
        • name: key1 secret: …

    c) Reload wasn’t working: kube-apiserver should have --encryption-provider-config-automatic-reload=true If reload is broken, old config persists

    Phase 4: Fix (safely, without data loss)
    1. FIX the encryption configuration order:

apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:

  • resources:
    • secrets providers:
    • aescbc: keys:
      • name: key1 secret:
    • identity: {} # Fallback AFTER encryption attempt
      2. Update kube-apiserver config and restart:
      kubectl set env deployment/kube-apiserver -n kube-system
      ENCRYPTION_CONFIG_UPDATED="true" kubectl rollout restart deployment/kube-apiserver -n kube-system

Monitor rollout

kubectl rollout status deployment/kube-apiserver -n kube-system
3. Force re-encryption of all secrets: kubectl get secrets --all-namespaces -o json |
kubectl apply -f -

This re-writes all secrets, forcing them to be encrypted with the fixed config

4. Verify encryption worked: etcdctl get /kubernetes.io/secrets/default/my-secret

Should now show encrypted binary data, not plaintext

Phase 5: Verification that fix worked
1. Sample encrypted secrets: for secret in $(kubectl get secrets -A -o name | shuf | head -5); do echo "=== $secret ===" etcdctl get /kubernetes.io/secrets/$(echo $secret | cut -d’/’ -f1)/$(echo $secret | cut -d’/’ -f2) | hexdump -C | head -3

Should show binary/hex, not ASCII plaintext

done

  1. Write a new secret and verify it’s encrypted: kubectl create secret generic test-secret --from-literal=key=value etcdctl get /kubernetes.io/secrets/default/test-secret | od -c

Should show binary gibberish, not plaintext "value"

  1. Audit log should show encryption was applied: kubectl logs -n kube-system kube-apiserver-* --tail=100 | grep -i 'secret.*encrypt|encrypt.*success'

    Phase 6: Prevention for the future
    1. Automated verification test:

apiVersion: batch/v1 kind: CronJob metadata: name: verify-encryption spec: schedule: "0 * * * *" # Every hour jobTemplate: spec: template: spec: containers: - name: verify image: alpine:latest command: - /bin/sh - -c - | # Create a test secret TEST_SECRET_NAME="encryption-test-$(date +%s)" kubectl create secret generic $TEST_SECRET_NAME --from-literal=test="plaintext-should-not-appear"

# Check if it’s encrypted in etcd ETCD_DATA=$(etcdctl get /kubernetes.io/secrets/default/$TEST_SECRET_NAME) if echo "$ETCD_DATA" | grep -q "plaintext-should-not-appear"; then echo "ALERT: Secrets are NOT encrypted!" exit 1 fi

# If we get here, encryption is working kubectl delete secret $TEST_SECRET_NAME echo "Encryption verified OK"

  1. Policy as code (using tools like Kubewarden): apiVersion: policies.kubewarden.io/v1 kind: ClusterAdmissionPolicy metadata: name: enforce-encrypted-secrets spec: policyServer: default module: ghcr.io/kubewarden/secrets-encryption rules:
  • apiGroups: [""] apiVersions: ["v1"] resources: ["secrets"] operations: ["CREATE", "UPDATE"]

This policy verifies that secrets are actually encrypted before allowing creation

Phase 7: Post-incident remediation
[] Fixed encryption configuration [] Re-encrypted all 30,000 unencrypted secrets [] Verified sample secrets are now encrypted [] Enabled hourly verification test [] Updated runbook for future encryption issues [] Team training on EncryptionConfiguration gotchas

Follow-up: How would you perform a key rotation for encrypted secrets without downtime? Design a graceful transition from old encryption key to new one.

You're managing secrets for 200+ microservices across staging and production. Services need credentials for: databases, cloud APIs, third-party SaaS, internal services. Currently all secrets are stored as Kubernetes Secrets, but this creates problems: secrets are scattered, hard to audit, and developers keep asking for access. You want to move to a centralized secrets manager (Vault). Design a migration strategy that keeps both working in parallel.

Migrating from Kubernetes Secrets to Vault is a big architectural change. The safest approach is running both in parallel, gradually moving services to Vault.

Phase 1: Design the target architecture
┌─────────────────────────────────────────┐ │ HashiCorp Vault │ │ ┌────────────────────────────────────┐ │ │ │ /secret/data/prod/db-password │ │ │ │ /secret/data/prod/stripe-api-key │ │ │ │ /secret/data/staging/test-db-pass │ │ │ └────────────────────────────────────┘ │ │ Features: audit logs, MFA, TTL, rotation │ └─────────────────────────────────────────┘ ↑ │ (pod auth via Kubernetes ServiceAccount) │ ┌──────────────────────────┐ │ Kubernetes Pod │ │ Sidecar proxy: vault-sidekcar pulls secrets from Vault │ OR │ External Secrets operator: syncs Vault→K8s Secrets └──────────────────────────┘

Phase 2: Deploy Vault in cluster
1. Install Vault via Helm: helm repo add hashicorp https://helm.releases.hashicorp.com helm install vault hashicorp/vault \ --set server.dataStorage.size=10Gi \ --set server.statefulSet.replicas=3 \ --namespace vault \ --create-namespace

  1. Initialize Vault: kubectl exec -n vault vault-0 – vault operator init
    –key-shares=5
    –key-threshold=3

Save unseal keys securely!

  1. Unseal Vault: for i in 1 2 3; do kubectl exec -n vault vault-0 – vault operator unseal KEY_$i done

  2. Configure Kubernetes auth: kubectl exec -n vault vault-0 – vault auth enable kubernetes kubectl exec -n vault vault-0 – vault write auth/kubernetes/config
    kubernetes_host=https://kubernetes.default.svc.cluster.local:443
    kubernetes_ca_cert="@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

    Phase 3: Setup External Secrets Operator (for parallel running)
    1. Install External Secrets:

helm repo add external-secrets https://charts.external-secrets.io helm install external-secrets external-secrets/external-secrets
–namespace external-secrets-system
–create-namespace

  1. Create SecretStore (Vault connection): apiVersion: external-secrets.io/v1beta1 kind: ClusterSecretStore metadata: name: vault-backend spec: provider: vault: server: "http://vault.vault.svc.cluster.local:8200" path: "secret" auth: kubernetes: mountPath: "kubernetes" role: "default"

  2. Create ExternalSecret to sync from Vault to K8s: apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: db-credentials namespace: production spec: refreshInterval: 1h secretStoreRef: name: vault-backend kind: ClusterSecretStore target: name: db-credentials # Creates K8s Secret with this name creationPolicy: Owner data:

  • secretKey: password remoteRef: key: prod/db-password

This automatically syncs Vault secret → Kubernetes Secret

Pods can use K8s Secret as before, but data comes from Vault

Phase 4: Migrate secrets incrementally
Wave 1: Non-critical, staging secrets (Week 1)

  1. For each staging service:
    • Create Vault path: /secret/data/staging/service-name/*
    • Populate with secrets from K8s
    • Deploy ExternalSecret
    • Verify pods work with synced secrets
    • Delete old K8s Secret

Wave 2: Non-critical, production secrets (Week 2)

  1. Same process, but with production services
  2. Schedule during low-traffic window
  3. Have rollback plan

Wave 3: Critical production secrets (Week 3-4)

  1. Take more time, extra validation
  2. A/B test: 50% of pods using Vault, 50% using K8s Secret
  3. Monitor error rates before full migration

PARALLEL STATE (safest): Some services still use K8s Secrets Other services synced from Vault via ExternalSecret Both work simultaneously

Phase 5: Setup audit logging in Vault
apiVersion: v1 kind: ConfigMap metadata: name: vault-audit-config data: audit-config.hcl: | audit { file { path = "/vault/logs/audit.log" } }

Now all secret access is logged:

- Who accessed the secret

- When

- From what IP

- What action (read, create, delete)

Phase 6: Secret rotation policy
1. For database credentials (need coordinated rotation): apiVersion: pam.vault.io/v1alpha1 kind: RotationPolicy metadata: name: db-rotation spec: secretPath: secret/data/prod/db-password rotationInterval: 30d # Rotate every 30 days rotation: databaseConnection: engine: postgresql connectionUri: "postgres://root:password@db.internal:5432/postgres" rotationStatements: - "ALTER USER app_user WITH PASSWORD '{{ password }}'"

Vault automatically:

1. Generates new password

2. Updates database

3. Updates secret in Vault

4. Pods pick up new secret via ExternalSecret refresh

  1. For API keys (can be rotated independently): kubectl exec -n vault vault-0 – vault write -f /secret/data/prod/stripe-api-key
    value=$(curl -H "Authorization: Bearer $STRIPE_TOKEN" https://api.stripe.com/v1/api_keys/generate | jq -r '.key')

    Phase 7: Access control with policies
    Create fine-grained Vault policies:

apiVersion: v1 kind: ConfigMap metadata: name: vault-policies data: developer.hcl: | path "secret/data/staging/*" { capabilities = ["read", "list"] } # Developers can only read staging secrets, not production

production-reader.hcl: | path "secret/data/prod/*" { capabilities = ["read"] } # Production services can read their secrets

sre.hcl: | path "secret/*" { capabilities = ["read", "list", "create", "update", "delete"] } # SREs have full access

admin.hcl: | path "/" { capabilities = [""] } # Only cluster admin

Phase 8: Monitoring and alerting
- alert: VaultSealed expr: vault_unsealed_status == 0 for: 1m annotations: summary: "Vault is sealed - pods can’t retrieve secrets!"

  • alert: UnauthorizedVaultAccess expr: rate(vault_audit_log_error_total[5m]) > 0.1 annotations: summary: "Unauthorized access attempts to Vault"

  • alert: ExternalSecretSyncFailure expr: externalsecret_status_sync_failures_total > 0 for: 5m annotations: summary: "ExternalSecret failed to sync from Vault"

    Final state after migration:
    OLD (Kubernetes Secrets only):

  • Secrets scattered across namespaces

  • No audit trail

  • Hard to rotate

  • Developers have broad Secret access via RBAC

NEW (Vault + ExternalSecret):

  • Centralized secret management
  • Full audit trail of all access
  • Automatic rotation
  • Fine-grained access control per service
  • Secrets encrypted at rest in Vault
  • Can use for non-Kubernetes systems too

Follow-up: How would you handle a situation where a developer needs to access a production secret in an emergency (e.g., to debug a production issue)? Design a "break glass" approval workflow in Vault.

A developer committed a secret (database password) to Git by accident. The commit is in your public GitHub repo. Yes, your pre-commit hooks were supposed to catch this, but they failed. Thousands of clones have happened. What's your response? How do you assess damage? What do you do in hours 0-1, 1-4, and beyond?

This is a critical incident. The secret is now visible to anyone who cloned the repo, and may be indexed by search engines or security scanners.

Hour 0 (immediate, first 5 minutes):
1. Confirm the leak: git log --oneline | grep -i secret git show | grep -E 'password|key|token' # Identify exactly what was leaked
2. Revoke the secret IMMEDIATELY: - If database password: change it NOW - If API key: revoke it NOW - If AWS credential: disable it NOW - If OAuth token: invalidate it NOW

Command example (PostgreSQL): ALTER USER app_user WITH PASSWORD 'new-secure-password-12345678';
3. Notify stakeholders: "Database password for production has been exposed in GitHub. Password has been rotated. No unauthorized access detected yet. Starting incident response."

Hour 0-1 (first hour):
1. Rewrite Git history (remove secret from all commits):

DO NOT do this lightly - it breaks clones for everyone

But it’s necessary to remove secret from GitHub history

Option A: BFG Repo-Cleaner (simpler):

bfg --delete-files production.env repo git reflog expire --expire=now --all && git gc --prune=now --aggressive

Option B: Git filter-branch (more control):

git filter-branch -f --tree-filter 'rm -f production.env' HEAD

Force push to remote (WARNING: breaks all existing clones):

git push origin --force --all

  1. Verify secret is gone from GitHub history: curl https://api.github.com/repos/company/repo/commits | jq '.[] | .sha' |
    while read sha; do git show $sha | grep -q "password=secret123" && echo "FOUND in $sha" done

Should find nothing now

  1. Monitor for secret use: a) Database logs: tail -f /var/log/postgresql/postgresql.log | grep -E 'auth.*fail|invalid.*password'

    Any failed auth attempts from new IPs?

b) Application logs: kubectl logs -f -n production -l app=myapp | grep -E 'connection.*refused|auth.*fail'

c) Cloud provider (AWS): aws cloudtrail lookup-events --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAIOSFODNN7EXAMPLE

Check for unusual API calls

  1. Check if secret was exposed to search engines:

    Hour 1-4 (first 4 hours):
    1. Deep forensic investigation: a) Who has access to the repo (collaborators)? b) Who cloned the repo in the last 30 days? c) When was the secret first exposed? d) Is there evidence of the secret being used by attackers?

  2. Check GitHub audit logs: GitHub Settings → Security log Look for: unusual clone IPs, API calls from unknown locations

  3. Query cloud provider audit logs: aws cloudtrail get-event-selectors --trail-name production-trail

    Look for: unauthorized DB access, IAM policy changes, new users created

  4. Database forensics: Check connection logs for source IPs using exposed credentials: SELECT * FROM pg_stat_statements WHERE query LIKE '%app_user%'; SHOW log_statement; # Verify logging is enabled

  5. Rotate all related secrets:

    • New database password
    • Update Kubernetes Secret
    • Restart pods that use this secret
    • Verify no connection errors in logs
  6. Update pre-commit hooks to PREVENT future incidents: Install git-secrets or similar: git secrets --install git secrets --register-aws git secrets --add-provider 'grep -i "password|api.key|token"'

    Hour 4+ (ongoing):
    1. Incident post-mortem: - How did the secret get committed? - Why did pre-commit hooks fail? - How do we prevent this in the future? - What's the cost of this incident?

  7. Implement preventive measures: a) Secret scanning in GitHub (enable free tier): GitHub Settings → Security → Secret scanning This alerts if secrets are detected in new commits

    b) Mandatory secret scanning in CI/CD:

    • truffleHog: scans for secrets
    • detect-secrets: discovers secrets in code
    • gitleaks: finds secrets in git history

    In CI/CD pipeline:

    • Pre-push check (runs locally)
    • Pre-commit check (blocks commit)
    • CI check (blocks PR merge)

    c) Access control:

    • Restrict who can push to main/master
    • Require PR reviews (including security review)
    • Only protected branches allowed
  8. Monitoring going forward:

    • GitHub secret scanning alerts
    • Vault audit logs (if using Vault)
    • Database connection anomalies
    • API key usage patterns

    Assessment of damage (depends on findings):
    Scenario A: No unauthorized access detected - Risk is low if secret was rotated quickly - Attacker likely didn't get the secret in time - Action: document incident, implement preventive measures

    Scenario B: Evidence of unauthorized access

    • Critical incident: assume full compromise
    • Action: follow Phase 1-5 of the etcd breach playbook
    • Rotate ALL secrets, investigate affected systems

    Scenario C: Secret was indexed by search engines

    • Search engines: request removal via Google Search Console
    • Pastebin/mirrors: request removal
    • Risk is higher, assume secret may have been accessed
    • Rotate more aggressively

Follow-up: How would you design a developer experience where they never need to handle raw secrets in code or config? Design a secrets injection framework.

You're implementing a secrets rotation policy: all API keys rotate every 30 days, all database passwords rotate every 14 days. Your automation works fine, but you hit a problem: the new secret is created in Vault, but old services still use the old secret and can't connect. You have thousands of pods. Manual restart isn't scalable. How do you handle seamless rotation without service disruption?

Seamless secret rotation requires: (1) dual credential support during rotation, (2) rapid secret refresh in pods, (3) health checking to verify connectivity after rotation.

Challenge: Kubernetes doesn't automatically update environment variables or mounted files when a Secret changes. Pods keep using stale values until restarted.

Solution 1: Grace period with dual credentials
1. During rotation, both old and new credentials work: a) Create new credential in target system (database, AWS, etc.) - PostgreSQL: CREATE USER app_user_new WITH PASSWORD 'new-password' - AWS: Create new access key - API provider: Generate new key

b) Update Vault with BOTH credentials: vault write secret/data/prod/db-password
username=app_user_new
password=new-password
username_old=app_user_old
password_old=old-password

c) Pods read new credential immediately (no restart needed if using vault-sidekcar)

d) Keep old credential active for 1 hour grace period

e) After grace period, revoke old credential: ALTER USER app_user_old NOLOGIN; # Disable old user

  1. Implementation with Vault agent: apiVersion: v1 kind: Pod metadata: name: app-pod annotations: vault.hashicorp.com/agent-inject: "true" vault.hashicorp.com/role: "app" vault.hashicorp.com/agent-inject-secret-database: "secret/data/prod/db-password" vault.hashicorp.com/agent-cache-enable: "true" vault.hashicorp.com/agent-cache-use-auto-auth-token: "force" spec: serviceAccountName: app containers:
  • name: app image: myapp:latest env:

    Pod runs vault-agent sidecar

    Vault-agent auto-refreshes secrets every 60 seconds

    Application reads from templated file (updated in real-time)

    Vault agent template (updates in real-time):
    {{ with secret "secret/data/prod/db-password" -}} export DB_USER="{{ .Data.data.username }}" export DB_PASSWORD="{{ .Data.data.password }}" {{ end }}

    Solution 2: Automatic pod restart via ExternalSecret rotation detection
    1. When ExternalSecret detects secret change, trigger pod restart:

apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: db-credentials namespace: production spec: refreshInterval: 5m # Check Vault every 5 minutes secretStoreRef: name: vault-backend kind: ClusterSecretStore target: name: db-credentials creationPolicy: Owner # IMPORTANT: Add a label that changes with secret content template: metadata: labels: secret-hash: "{{ .secret.Data.password | b64enc | trunc 8 }}"

  1. Deploy a controller that watches for label changes: apiVersion: v1 kind: ConfigMap metadata: name: secret-rotation-watcher data: watcher.py: | import subprocess import hashlib import time

    def get_secret_hash(secret_name, namespace): result = subprocess.run(['kubectl', 'get', 'secret', secret_name, '-n', namespace, '-o', 'jsonpath={.data.password}'], capture_output=True) return hashlib.sha256(result.stdout).hexdigest()

    old_hash = None while True: current_hash = get_secret_hash('db-credentials', 'production') if old_hash and current_hash != old_hash: # Secret changed! Restart pods subprocess.run(['kubectl', 'rollout', 'restart', 'deployment/app', '-n', 'production']) old_hash = current_hash time.sleep(60)

  2. Deploy the watcher: kubectl apply -f secret-rotation-watcher-job.yaml

    Solution 3: Service mesh integration (most elegant)
    apiVersion: security.istio.io/v1beta1

kind: DestinationRule metadata: name: database-mtls spec: host: db.prod.svc.cluster.local trafficPolicy: tls: mode: MUTUAL # mTLS clientCertificate: /etc/ssl/certs/client-cert.pem clientKey: /etc/ssl/certs/client-key.pem connectionPool: tcp: maxConnections: 100 http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2 h2UpgradePolicy: UPGRADE

With mTLS via service mesh:

- Certificates are managed by Istio, auto-rotated

- No application code changes needed

- Pods don’t restart, Istio handle cert refresh

- Connection pool handles transient failures during rotation

Solution 4: Health checks + graceful degradation
During rotation, some pods may temporarily fail:

  1. Implement retry logic in application: def connect_to_db(max_retries=3): for attempt in range(max_retries): try: conn = psycopg2.connect( dbname="mydb", user=os.getenv("DB_USER"), password=os.getenv("DB_PASSWORD"), host="db.prod.svc.cluster.local" ) return conn except psycopg2.OperationalError as e: if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff else: raise

  2. Kubernetes liveness probe checks connectivity: livenessProbe: exec: command:

    • /bin/sh
    • -c
    • | psql -h db.prod.svc.cluster.local -U $DB_USER -d mydb -c "SELECT 1" initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3

If pod fails liveness check (can’t connect with rotated secret),

Kubernetes automatically restarts it. On restart, it reads new secret from K8s Secret

(which was updated by ExternalSecret)

Solution 5: Zero-downtime rotation orchestration
apiVersion: batch/v1 kind: CronJob metadata: name: secret-rotation-orchestrator spec: schedule: "0 2 * * *" # Daily at 2 AM jobTemplate: spec: template: spec: containers: - name: rotator image: rotation-orchestrator:latest env: - name: GRACE_PERIOD_SECONDS value: "3600" # 1 hour grace period - name: BATCH_SIZE value: "10" # Restart 10 pods at a time script: | #!/bin/bash

# Step 1: Create new credential in Vault vault kv put secret/prod/db-password username_new=$NEW_USER password_new=$NEW_PASS

# Step 2: ExternalSecret automatically syncs new credential sleep 60 # Wait for ExternalSecret to refresh

# Step 3: Restart pods in batches (with health checks between) for deployment in app1 app2 app3; do kubectl rollout restart deployment/$deployment -n production kubectl rollout status deployment/$deployment -n production --timeout=5m done

# Step 4: Grace period (keep both credentials active) sleep 3600

# Step 5: Revoke old credential vault kv patch secret/prod/db-password username_old="" password_old="" # PostgreSQL: ALTER USER app_user_old NOLOGIN;

Best practice checklist:
[] Use Vault + ExternalSecret (auto-refresh secrets) [] Implement health checks for post-rotation verification [] Grace period: keep old + new credentials during rotation [] Batch pod restarts: avoid thundering herd [] Monitor rotation process: alert on failures [] Automatic rollback if rotation fails [] Test rotation in staging first [] Document rotation procedure for manual override [] Audit all rotation events

Follow-up: What happens if a secret rotation fails midway (e.g., new credential creation fails, but old one was already revoked)? Design a rollback mechanism.

A developer on your team is asking for a way to access production secrets locally for debugging. Currently, it's impossible because secrets are only in Vault/Kubernetes. You want to enable this safely, but with strong guardrails: minimal privileges, temporary access, full audit trail, automatic expiry, and approval. Design a "dev escape hatch" for production secret access.

This is a critical security question: balancing operational agility (developers need to debug) with security (minimize exposure of production secrets).

Solution: Temporary elevated access with approval, TTL, and auditing
Architecture: Developer → Request tool → Approval workflow → Vault issues temporary credential → Automatic expiry

Step 1: Developer requests temporary secret access $ vault-breakglass request secret:prod/db-password --reason "Debug connection pool issue" --duration 1h Request ID: bkg-2024-04-07-001 Status: PENDING

Step 2: Approval workflow (goes to Slack/PagerDuty) @on-call-sre please approve secret access for @alice Reason: Debug connection pool issue Secret: prod/db-password Duration: 1 hour

[APPROVE] [DENY] [APPROVE WITH RESTRICTIONS]

Step 3: SRE reviews and approves /approve bkg-2024-04-07-001

Step 4: Vault issues temporary credential $ vault-breakglass retrieve bkg-2024-04-07-001 DB_PASSWORD=temporary-secret-only-valid-for-1-hour Expires: 2024-04-07T20:00:00Z

This credential: - Works only for 1 hour - Can only read (not modify) - Is tagged in audit logs - Will auto-revoke at expiry

Step 5: Automatic cleanup After 1 hour, Vault: - Revokes the credential - Logs the revocation - Notifies the developer - Removes access

Implementation using Vault + external tooling:

apiVersion: v1 kind: ConfigMap metadata: name: breakglass-policy data: breakglass.hcl: | # Only used for temporary breakglass access path "secret/data/prod/*" { capabilities = ["read"] # Note: NO write, delete, or admin capabilities }

apiVersion: v1 kind: ServiceAccount metadata: name: breakglass-approver namespace: vault-admin

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: breakglass-approver rules:

  • apiGroups: ["vault.internal"] resources: ["breakglassrequests"] verbs: ["get", "list", "watch", "update"]

Breakglass request CRD

apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: breakglassrequests.vault.internal spec: names: kind: BreakglassRequest plural: breakglassrequests scope: Namespaced versions:

  • name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: requesterEmail: type: string secretPath: type: string reason: type: string durationSeconds: type: integer maxDurationSeconds: type: integer default: 3600 # Max 1 hour status: type: object properties: approved: type: boolean approverEmail: type: string approvedAt: type: string expiresAt: type: string credentialId: type: string

    Approval workflow implementation (CLI tool):
    #!/bin/bash

vault-breakglass request

case $1 in request) SECRET=$2 REASON=$3 DURATION=${4:-3600}

# Validate inputs if [[ ! $SECRET =~ ^secret/prod/ ]]; then echo "ERROR: Can only request prod secrets" exit 1 fi

if [[ $DURATION -gt 3600 ]]; then echo "ERROR: Max duration is 1 hour" exit 1 fi

# Create Kubernetes CRD for request kubectl apply -f - <<EOF apiVersion: vault.internal/v1 kind: BreakglassRequest metadata: name: bkg-$(date +%s)-$RANDOM namespace: vault-admin spec: requesterEmail: $(git config user.email) secretPath: $SECRET reason: "$REASON" durationSeconds: $DURATION EOF

# Notify approvers via Slack webhook curl -X POST $SLACK_WEBHOOK -d '{ "text": "New breakglass request", "blocks": [ {"type": "section", "text": {"type": "mrkdwn", "text": "Secret: '$SECRET'\nReason: '$REASON'\nDuration: '$DURATION’s"}}, {"type": "actions", "elements": [ {"type": "button", "text": {"type": "plain_text", "text": "Approve"}, "value": "approve", "action_id": "breakglass_approve"}, {"type": "button", "text": {"type": "plain_text", "text": "Deny"}, "value": "deny", "action_id": "breakglass_deny"} ]} ] }' ;;

retrieve) REQUEST_ID=$2

# Check if approved STATUS=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.approved}') if [[ $STATUS != "true" ]]; then echo "ERROR: Request not approved" exit 1 fi

# Check if expired EXPIRES=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.expiresAt}') if [[ $(date +%s) -gt $(date -d $EXPIRES +%s) ]]; then echo "ERROR: Request has expired" exit 1 fi

# Issue temporary credential from Vault vault read -format=json auth/approle/role/breakglass/secret-id
metadata="request_id=$REQUEST_ID" | jq -r '.auth.client_token' ;; esac

Audit and compliance:
apiVersion: v1 kind: ConfigMap metadata: name: vault-audit-breakglass data: audit-policy.json: | { "type": "file", "options": { "file_path": "/var/log/vault/audit.log", "log_raw": true } }

All breakglass access is logged:

  • Who requested
  • What secret
  • When
  • For how long
  • Who approved
  • What they actually read/did
  • When it expired

Monthly audit report: vault audit logs | jq 'select(.auth.metadata.breakglass == "true")' |
group_by(.auth.display_name) |
map({user: .[0].auth.display_name, accesses: length})

Compliance and guardrails:
[] Max duration: 1 hour [] Reason: required and audited [] Approval: required (no self-approval) [] Scope: read-only access [] Audit: all access logged [] Auto-expiry: credential revoked after TTL [] Alerting: unusual patterns trigger alerts

Unusual patterns alert:

  • alert: HighBreakglassUsage expr: count(rate(vault_breakglass_requests_total[5m])) > 5 annotations: summary: "More than 5 breakglass requests/min (unusual activity)"

  • alert: LongDurationBreakglass expr: vault_breakglass_duration_seconds > 3600 annotations: summary: "Breakglass request exceeded max duration"

    Usage examples:
    # Request prod database password for 30 minutes

vault-breakglass request secret:prod/db-password --reason "Debug slow queries" --duration 1800

List pending requests (for approvers)

kubectl get breakglassrequests -n vault-admin

Retrieve the credential (works for 30 min, then auto-expires)

vault-breakglass retrieve bkg-2024-04-07-001

Follow-up: How would you detect if a developer abused their temporary breakglass access (e.g., exfiltrated secrets, modified data, or accessed secrets they shouldn't)? Design a detection system.

You're switching from storing database passwords in environment variables to using a secrets manager (Vault). The problem: existing deployments have the old passwords in ConfigMaps and environment variables. You can't delete them until all pods are restarted. But if you restart all pods at once, your cluster goes down. Design a safe migration where you run both old and new secrets in parallel.

Secrets migration requires zero-downtime switchover. You can't cut over abruptly.

Strategy:
Phase 1: Deploy Vault and sync secrets (both old and new active)
Phase 2: Update 10% of pods to read from Vault
Phase 3: Monitor, validate, then expand to 50%, then 100%
Phase 4: Delete old ConfigMaps/env vars once 100% migrated

Implementation:
Original deployment uses env vars: spec: containers: - name: app env: - name: DB_PASSWORD valueFrom: configMapKeyRef: name: db-creds key: password
Step 1: Deploy ExternalSecret alongside:
apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: db-credentials-vault spec: secretStoreRef: name: vault target: name: db-credentials-from-vault data: - secretKey: password remoteRef: key: prod/db-password
Step 2: Canary: 10% of pods use Vault secret
spec: containers: - name: app env: - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-credentials-from-vault # NEW: from Vault key: password
Step 3: Monitor error rates, database connection errors, latency
If issues: can revert by scaling back canary
Step 4: Gradually expand to 25%, 50%, 100%
Step 5: Delete old ConfigMaps
kubectl delete configmap db-creds

Validation during each phase:
- Application logs show no connection errors
- Database connection pool healthy
- Vault audit logs show secret access
- No increase in error rate

Follow-up: What happens if a pod becomes unable to reach Vault during this migration? Design a fallback mechanism.

You manage secrets for 50 microservices. Each service has 5-10 secrets (API keys, database credentials, OAuth tokens, etc.). Developers keep asking for access to specific secrets ("I need the Stripe key to debug payment processing"). You can't give them production access directly, but denying breaks their productivity. Design a secure but frictionless model.

Balance: security without friction. Developers need enough access to be productive, but with guardrails.

Model:
- Developers read staging secrets freely (low security risk)
- Production secrets: read-only access via audit logs
- Sensitive operations (modify secrets): require approval
- Access automatically revokes after TTL

Tiered access:
TIER 1: Developer (Staging) - Read all staging secrets - No TTL (permanent) - No approval needed

TIER 2: Developer (Production, Debug)

  • Read-only access to specific secrets
  • 24-hour TTL
  • Reason required (audit trail)
  • No approval needed (auto-grant)

TIER 3: Developer (Production, Modify)

  • Change production secrets
  • 1-hour duration
  • Requires approval from Ops
  • Fully audited

TIER 4: Ops/SRE (All)

  • Full access to all secrets
  • No TTL
  • Automatic audit logging

    Implementation with Vault:
    Dev can request staging password:

vault kv get secret/staging/myapp/db-password

Works immediately, no approval

Dev can request production password (read-only): vault kv get secret/prod/myapp/db-password --ttl=24h

Works with TTL, logged, no approval (auto-grant policy)

Dev wants to rotate production password: vault request change-secret secret/prod/myapp/db-password

Requires approval from SRE, only then can they update

Vault policy:
path "secret/staging/*" { capabilities = ["read", "list"]

Developers can read staging freely

}

path "secret/prod/*" { capabilities = ["read", "list"]

Read-only production (no create/update)

ttl = "24h" }

path "secret/prod/*/modify" { capabilities = ["update"]

Requires approval

}

Monitoring suspicious patterns:
- Dev reading same secret 100x in 1 hour (exfiltration attempt?)
- Dev reading secrets outside their team
- Dev requesting modifications without reason
- Off-hours access patterns

Follow-up: How do you prevent developers from exfiltrating production secrets they're allowed to read?

Want to go deeper?