Your etcd backup (10GB) containing your entire Kubernetes cluster state was accidentally uploaded to S3 public bucket. You have 50,000+ Kubernetes Secrets in your cluster (API keys, database passwords, TLS certificates). An attacker has the backup. Answer immediately: Are your secrets compromised? How do you respond? What's your investigation and remediation timeline?
Critical question: Were secrets encrypted at rest in etcd? The answer determines if you have a catastrophic incident or a containable breach.
Phase 1: Immediate triage (0-5 minutes)
1. Check if etcd encryption was enabled:
kubectl get configmap -n kube-system encryption-config -o yaml
# If not found, encryption was NOT enabled = secrets are in plaintext in the backup
Look for: --encryption-provider-config flag
If encryption is NOT enabled (likely scenario): CRITICAL INCIDENT
All secrets are in plaintext in the etcd backup:
- Database credentials
- API keys (AWS, GitHub, Stripe)
- OAuth tokens
- TLS certificates
- Service account tokens
Attacker can:
-
Immediately access all external services (cloud APIs, databases, SaaS)
-
Impersonate any service account
-
Decrypt traffic between services
-
Escalate to full cluster compromise
Phase 2: Emergency response (5-30 minutes)
ASSUME FULL COMPROMISE. Do this in parallel: -
Revoke all secrets immediately: for secret in $(kubectl get secrets -A -o jsonpath='{.items[*].metadata.name}'); do
If it’s a credential (api-key, password, token), mark it revoked
echo "REVOKE: $secret" >> /tmp/compromised-secrets.txt done
-
Rotate ALL external credentials: a) Database passwords:
- Change all DB passwords
- Update connection strings in ConfigMaps
- Restart affected pods b) API keys:
- Revoke old keys in Stripe, AWS, GitHub, etc.
- Generate new keys
- Update Kubernetes Secrets
- Restart pods consuming these keys c) OAuth tokens:
- Invalidate tokens at OAuth provider
- Generate new tokens d) TLS certificates:
- If CA private key was in the backup, certificates are compromised
- Issue new certificates with new CA
- This is EXPENSIVE but necessary
-
Roll service account tokens: kubectl create serviceaccount temp-sa -n production
Copy permissions from old SA to temp-sa
Update all pod specs to use temp-sa
Eventually delete old SAs
- Monitor for suspicious activity:
kubectl logs -n kube-system kube-apiserver-* --tail=500 | grep -E 'unauthorized|forbidden|token.*invalid'
Check cloud provider (AWS) for unusual API calls
Check databases for unauthorized access
Phase 3: Investigation (30 minutes - 2 hours)
1. Analyze the etcd backup to determine exposure:
Extract secrets from the backup (simulating attacker’s view)
etcdctl --endpoints=file:///backup/etcd-snapshot snapshot restore --data-dir=/tmp/etcd-test
Query the restored etcd for secrets
etcdctl get --from-key '' /kubernetes.io/secrets/ | grep -E 'password|token|key' | wc -l
Count: how many secrets were exposed?
- Classify exposed secrets:
- High: Database credentials, API keys (can’t easily rotate)
- Medium: OAuth tokens, temporary keys (can be rotated)
- Low: TLS certs (can be replaced with new CA)
- Determine impact per service: For each exposed credential:
- Which services use it?
- What damage could attacker do?
- How quickly can we rotate it?
- Is it used in critical path?
- Check logs for actual compromise: kubectl logs -A --all-containers=true --since=2h | grep -E 'auth.*failed|unauthorized.*attempt|suspicious'
Did attacker actually access something, or just had the file?
Phase 4: Remediation (2-8 hours)
Priority 1: Rotate HIGH-risk secrets (database creds, production API keys)
-
Create new secret in Kubernetes: kubectl create secret generic db-credentials-new
–from-literal=password=$(openssl rand -base64 32)
-n production -
Update deployments to use new secret: kubectl patch deployment app -n production -p
'{"spec":{"template":{"spec":{"containers":[{"name":"app","env":[{"name":"DB_PASSWORD","valueFrom":{"secretKeyRef":{"name":"db-credentials-new"}}}]}]}}}}' -
Restart pods: kubectl rollout restart deployment/app -n production
-
Update the external service (database, AWS account, etc.):
DB: ALTER USER app_user IDENTIFIED BY 'new-password';
AWS: Create new IAM access key, revoke old one
Stripe: Create new API key, revoke old one
Priority 2: Rotate MEDIUM-risk secrets (OAuth tokens, session keys)
- Similar process to Priority 1, but less critical path impact
Priority 3: Replace TLS certificates
- This is expensive (requires new CA if private key was exposed)
- Can be done in parallel with other rotations
- Rolling update of TLS secrets to all affected pods
Phase 5: Long-term hardening
Enable encryption at rest in etcd (SHOULD HAVE BEEN DONE):
-
Generate encryption key: openssl rand -base64 32 > /etc/kubernetes/encryption-key
-
Create encryption config: apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:
- resources:
- secrets providers:
- aescbc:
keys:
- name: key1
secret:
- name: key1
secret:
- identity: {} # Fallback to unencrypted (for graceful migration)
-
Update kube-apiserver: –encryption-provider-config=/etc/kubernetes/encryption-config.yaml –encryption-provider-config-automatic-reload=true
-
Restart kube-apiserver: kubectl rollout restart deployment/kube-apiserver -n kube-system
-
Re-encrypt existing secrets: kubectl get secrets --all-namespaces -o json |
kubectl apply -f -
This forces etcd to re-encrypt all existing secrets with the new key
Now, if etcd backup is exfiltrated again, secrets are encrypted:
etcdctl snapshot info /backup/etcd-snapshot
Shows data, but secrets are encrypted
Attacker has only encrypted blobs, not plaintext credentials
Phase 6: Backup security hardening
1. Encrypt backups:
etcdctl snapshot save /backups/etcd-$(date +%s).db --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/etcd.pem --key=/etc/etcd/etcd-key.pem
gpg --symmetric --cipher-algo AES256 /backups/etcd-*.db
Encrypt backup with GPG
- Restrict backup storage:
- Private S3 bucket with encryption-at-rest
- No public access
- Restricted IAM roles for access
- MFA required for bucket access
- Audit backup access: aws s3api get-bucket-logging --bucket etcd-backups
Enable S3 access logging
Monitor for unexpected downloads
- Implement backup versioning and retention:
- Keep multiple versions (can’t delete all at once)
- Delete old backups after 90 days (or your retention policy)
- This limits exposure window if backup is found
Phase 7: Post-incident checklist
[] All HIGH-risk secrets rotated
[] All MEDIUM-risk secrets rotated [] TLS certificates replaced (if private key was exposed) [] Encryption at rest enabled in etcd [] etcd backups now encrypted [] Backup storage hardened [] No suspicious activity detected in logs (24h monitoring) [] Team trained on secret management [] Post-mortem scheduled (review how backup got leaked)
Follow-up: How would you detect if an attacker actually used the compromised secrets? Design a monitoring and alerting strategy for anomalous credential usage.
You're implementing EncryptionConfiguration in Kubernetes to encrypt secrets at rest in etcd. A bug in your migration causes the system to mark all secrets as "unencrypted" even after re-encryption. You realize that 30,000 new secrets created post-migration are not actually encrypted. How do you detect this, quantify the damage, and fix it?
This is a subtle but critical bug: encryption configuration exists, but secrets aren't actually being encrypted. The system has no way to know which secrets are encrypted and which aren't.
Phase 1: Detection Expected encrypted output: binary gibberish (AES-CBC encrypted)
Actual (unencrypted): readable JSON with plaintext values
1. Check encryption configuration:
kubectl get configmap -n kube-system encryption-config -o yaml
# Verify encryption-provider-config is set
Should see flag
If you can read plaintext values, secrets are NOT encrypted
Phase 2: Quantify the damage
1. Determine which secrets are unencrypted:
kubectl get secrets -A -o json |
jq -r '.items[] | "(.metadata.namespace)/(.metadata.name) (.metadata.creationTimestamp)"' > /tmp/all-secrets.txt
- Compare creation timestamp to migration date: MIGRATION_DATE="2024-04-05T00:00:00Z" grep -v "$(echo $MIGRATION_DATE | cut -d’T’ -f1)" /tmp/all-secrets.txt > /tmp/post-migration-secrets.txt wc -l /tmp/post-migration-secrets.txt
Count: ~30,000 unencrypted secrets
Phase 3: Root cause analysis
1. Check encryption provider logs:
kubectl logs -n kube-system kube-apiserver-* | grep -i encrypt | tail -50
-
Common bugs: a) Encryption key not loaded properly:
- Check key file exists: ls -la /etc/kubernetes/encryption-key
- Check file permissions: stat /etc/kubernetes/encryption-key
b) Identity provider is primary (should be fallback): apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:
- resources:
- secrets providers:
- identity: {} # BUG: This is first, so secrets are NOT encrypted
- aescbc:
keys:
- name: key1 secret: …
c) Reload wasn’t working: kube-apiserver should have --encryption-provider-config-automatic-reload=true If reload is broken, old config persists
Phase 4: Fix (safely, without data loss)
1. FIX the encryption configuration order:
apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:
- resources:
- secrets providers:
- aescbc:
keys:
- name: key1
secret:
- name: key1
secret:
- identity: {} # Fallback AFTER encryption attempt
2. Update kube-apiserver config and restart:
kubectl set env deployment/kube-apiserver -n kube-system
ENCRYPTION_CONFIG_UPDATED="true" kubectl rollout restart deployment/kube-apiserver -n kube-system
Monitor rollout
kubectl rollout status deployment/kube-apiserver -n kube-system
3. Force re-encryption of all secrets:
kubectl get secrets --all-namespaces -o json |
kubectl apply -f -
This re-writes all secrets, forcing them to be encrypted with the fixed config
4. Verify encryption worked:
etcdctl get /kubernetes.io/secrets/default/my-secret
Should now show encrypted binary data, not plaintext
Phase 5: Verification that fix worked
1. Sample encrypted secrets:
for secret in $(kubectl get secrets -A -o name | shuf | head -5); do
echo "=== $secret ==="
etcdctl get /kubernetes.io/secrets/$(echo $secret | cut -d’/’ -f1)/$(echo $secret | cut -d’/’ -f2) | hexdump -C | head -3
Should show binary/hex, not ASCII plaintext
done
- Write a new secret and verify it’s encrypted: kubectl create secret generic test-secret --from-literal=key=value etcdctl get /kubernetes.io/secrets/default/test-secret | od -c
Should show binary gibberish, not plaintext "value"
- Audit log should show encryption was applied:
kubectl logs -n kube-system kube-apiserver-* --tail=100 | grep -i 'secret.*encrypt|encrypt.*success'
Phase 6: Prevention for the future
1. Automated verification test:
apiVersion: batch/v1 kind: CronJob metadata: name: verify-encryption spec: schedule: "0 * * * *" # Every hour jobTemplate: spec: template: spec: containers: - name: verify image: alpine:latest command: - /bin/sh - -c - | # Create a test secret TEST_SECRET_NAME="encryption-test-$(date +%s)" kubectl create secret generic $TEST_SECRET_NAME --from-literal=test="plaintext-should-not-appear"
# Check if it’s encrypted in etcd ETCD_DATA=$(etcdctl get /kubernetes.io/secrets/default/$TEST_SECRET_NAME) if echo "$ETCD_DATA" | grep -q "plaintext-should-not-appear"; then echo "ALERT: Secrets are NOT encrypted!" exit 1 fi
# If we get here, encryption is working kubectl delete secret $TEST_SECRET_NAME echo "Encryption verified OK"
- Policy as code (using tools like Kubewarden): apiVersion: policies.kubewarden.io/v1 kind: ClusterAdmissionPolicy metadata: name: enforce-encrypted-secrets spec: policyServer: default module: ghcr.io/kubewarden/secrets-encryption rules:
- apiGroups: [""] apiVersions: ["v1"] resources: ["secrets"] operations: ["CREATE", "UPDATE"]
This policy verifies that secrets are actually encrypted before allowing creation
Phase 7: Post-incident remediation
[] Fixed encryption configuration
[] Re-encrypted all 30,000 unencrypted secrets
[] Verified sample secrets are now encrypted
[] Enabled hourly verification test
[] Updated runbook for future encryption issues
[] Team training on EncryptionConfiguration gotchas
Follow-up: How would you perform a key rotation for encrypted secrets without downtime? Design a graceful transition from old encryption key to new one.
You're managing secrets for 200+ microservices across staging and production. Services need credentials for: databases, cloud APIs, third-party SaaS, internal services. Currently all secrets are stored as Kubernetes Secrets, but this creates problems: secrets are scattered, hard to audit, and developers keep asking for access. You want to move to a centralized secrets manager (Vault). Design a migration strategy that keeps both working in parallel.
Migrating from Kubernetes Secrets to Vault is a big architectural change. The safest approach is running both in parallel, gradually moving services to Vault.
Phase 1: Design the target architecture
┌─────────────────────────────────────────┐
│ HashiCorp Vault │
│ ┌────────────────────────────────────┐ │
│ │ /secret/data/prod/db-password │ │
│ │ /secret/data/prod/stripe-api-key │ │
│ │ /secret/data/staging/test-db-pass │ │
│ └────────────────────────────────────┘ │
│ Features: audit logs, MFA, TTL, rotation │
└─────────────────────────────────────────┘
↑
│ (pod auth via Kubernetes ServiceAccount)
│
┌──────────────────────────┐
│ Kubernetes Pod │
│ Sidecar proxy: vault-sidekcar pulls secrets from Vault
│ OR
│ External Secrets operator: syncs Vault→K8s Secrets
└──────────────────────────┘
Phase 2: Deploy Vault in cluster Unseal Vault:
for i in 1 2 3; do
kubectl exec -n vault vault-0 – vault operator unseal KEY_$i
done Configure Kubernetes auth:
kubectl exec -n vault vault-0 – vault auth enable kubernetes
kubectl exec -n vault vault-0 – vault write auth/kubernetes/config
1. Install Vault via Helm:
helm repo add hashicorp https://helm.releases.hashicorp.com
helm install vault hashicorp/vault \
--set server.dataStorage.size=10Gi \
--set server.statefulSet.replicas=3 \
--namespace vault \
--create-namespace
–key-shares=5
–key-threshold=3Save unseal keys securely!
kubernetes_host=https://kubernetes.default.svc.cluster.local:443
kubernetes_ca_cert="@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
Phase 3: Setup External Secrets Operator (for parallel running) helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets Create SecretStore (Vault connection):
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "http://vault.vault.svc.cluster.local:8200"
path: "secret"
auth:
kubernetes:
mountPath: "kubernetes"
role: "default" Create ExternalSecret to sync from Vault to K8s:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials # Creates K8s Secret with this name
creationPolicy: Owner
data:
1. Install External Secrets:
–namespace external-secrets-system
–create-namespace
This automatically syncs Vault secret → Kubernetes Secret
Pods can use K8s Secret as before, but data comes from Vault
Phase 4: Migrate secrets incrementally
Wave 1: Non-critical, staging secrets (Week 1)
- For each staging service:
- Create Vault path: /secret/data/staging/service-name/*
- Populate with secrets from K8s
- Deploy ExternalSecret
- Verify pods work with synced secrets
- Delete old K8s Secret
Wave 2: Non-critical, production secrets (Week 2)
- Same process, but with production services
- Schedule during low-traffic window
- Have rollback plan
Wave 3: Critical production secrets (Week 3-4)
- Take more time, extra validation
- A/B test: 50% of pods using Vault, 50% using K8s Secret
- Monitor error rates before full migration
PARALLEL STATE (safest): Some services still use K8s Secrets Other services synced from Vault via ExternalSecret Both work simultaneously
Phase 5: Setup audit logging in Vault
apiVersion: v1
kind: ConfigMap
metadata:
name: vault-audit-config
data:
audit-config.hcl: |
audit {
file {
path = "/vault/logs/audit.log"
}
}
Now all secret access is logged:
- Who accessed the secret
- When
- From what IP
- What action (read, create, delete)
Phase 6: Secret rotation policy
1. For database credentials (need coordinated rotation):
apiVersion: pam.vault.io/v1alpha1
kind: RotationPolicy
metadata:
name: db-rotation
spec:
secretPath: secret/data/prod/db-password
rotationInterval: 30d # Rotate every 30 days
rotation:
databaseConnection:
engine: postgresql
connectionUri: "postgres://root:password@db.internal:5432/postgres"
rotationStatements:
- "ALTER USER app_user WITH PASSWORD '{{ password }}'"
Vault automatically:
1. Generates new password
2. Updates database
3. Updates secret in Vault
4. Pods pick up new secret via ExternalSecret refresh
- For API keys (can be rotated independently):
kubectl exec -n vault vault-0 – vault write -f /secret/data/prod/stripe-api-key
value=$(curl -H "Authorization: Bearer $STRIPE_TOKEN" https://api.stripe.com/v1/api_keys/generate | jq -r '.key')Phase 7: Access control with policies
Create fine-grained Vault policies:
apiVersion: v1 kind: ConfigMap metadata: name: vault-policies data: developer.hcl: | path "secret/data/staging/*" { capabilities = ["read", "list"] } # Developers can only read staging secrets, not production
production-reader.hcl: | path "secret/data/prod/*" { capabilities = ["read"] } # Production services can read their secrets
sre.hcl: | path "secret/*" { capabilities = ["read", "list", "create", "update", "delete"] } # SREs have full access
admin.hcl: | path "/" { capabilities = [""] } # Only cluster admin
Phase 8: Monitoring and alerting
- alert: VaultSealed
expr: vault_unsealed_status == 0
for: 1m
annotations:
summary: "Vault is sealed - pods can’t retrieve secrets!"
-
alert: UnauthorizedVaultAccess expr: rate(vault_audit_log_error_total[5m]) > 0.1 annotations: summary: "Unauthorized access attempts to Vault"
-
alert: ExternalSecretSyncFailure expr: externalsecret_status_sync_failures_total > 0 for: 5m annotations: summary: "ExternalSecret failed to sync from Vault"
Final state after migration:
OLD (Kubernetes Secrets only): -
Secrets scattered across namespaces
-
No audit trail
-
Hard to rotate
-
Developers have broad Secret access via RBAC
NEW (Vault + ExternalSecret):
- Centralized secret management
- Full audit trail of all access
- Automatic rotation
- Fine-grained access control per service
- Secrets encrypted at rest in Vault
- Can use for non-Kubernetes systems too
Follow-up: How would you handle a situation where a developer needs to access a production secret in an emergency (e.g., to debug a production issue)? Design a "break glass" approval workflow in Vault.
A developer committed a secret (database password) to Git by accident. The commit is in your public GitHub repo. Yes, your pre-commit hooks were supposed to catch this, but they failed. Thousands of clones have happened. What's your response? How do you assess damage? What do you do in hours 0-1, 1-4, and beyond?
This is a critical incident. The secret is now visible to anyone who cloned the repo, and may be indexed by search engines or security scanners.
Hour 0 (immediate, first 5 minutes): Command example (PostgreSQL):
ALTER USER app_user WITH PASSWORD 'new-secure-password-12345678';
1. Confirm the leak:
git log --oneline | grep -i secret
git show
2. Revoke the secret IMMEDIATELY:
- If database password: change it NOW
- If API key: revoke it NOW
- If AWS credential: disable it NOW
- If OAuth token: invalidate it NOW
3. Notify stakeholders:
"Database password for production has been exposed in GitHub.
Password has been rotated. No unauthorized access detected yet.
Starting incident response."
Hour 0-1 (first hour):
1. Rewrite Git history (remove secret from all commits):
DO NOT do this lightly - it breaks clones for everyone
But it’s necessary to remove secret from GitHub history
Option A: BFG Repo-Cleaner (simpler):
bfg --delete-files production.env repo git reflog expire --expire=now --all && git gc --prune=now --aggressive
Option B: Git filter-branch (more control):
git filter-branch -f --tree-filter 'rm -f production.env' HEAD
Force push to remote (WARNING: breaks all existing clones):
git push origin --force --all
- Verify secret is gone from GitHub history:
curl https://api.github.com/repos/company/repo/commits | jq '.[] | .sha' |
while read sha; do git show $sha | grep -q "password=secret123" && echo "FOUND in $sha" done
Should find nothing now
- Monitor for secret use:
a) Database logs:
tail -f /var/log/postgresql/postgresql.log | grep -E 'auth.*fail|invalid.*password'
Any failed auth attempts from new IPs?
b) Application logs: kubectl logs -f -n production -l app=myapp | grep -E 'connection.*refused|auth.*fail'
c) Cloud provider (AWS): aws cloudtrail lookup-events --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAIOSFODNN7EXAMPLE
Check for unusual API calls
-
Check if secret was exposed to search engines:
- Google Cache: https://webcache.googleusercontent.com/…
- GitHub API history (check 3rd party services)
- Pastebin, GitLab mirrors, etc.
Hour 1-4 (first 4 hours):
1. Deep forensic investigation: a) Who has access to the repo (collaborators)? b) Who cloned the repo in the last 30 days? c) When was the secret first exposed? d) Is there evidence of the secret being used by attackers? -
Check GitHub audit logs: GitHub Settings → Security log Look for: unusual clone IPs, API calls from unknown locations
-
Query cloud provider audit logs: aws cloudtrail get-event-selectors --trail-name production-trail
Look for: unauthorized DB access, IAM policy changes, new users created
-
Database forensics: Check connection logs for source IPs using exposed credentials: SELECT * FROM pg_stat_statements WHERE query LIKE '%app_user%'; SHOW log_statement; # Verify logging is enabled
-
Rotate all related secrets:
- New database password
- Update Kubernetes Secret
- Restart pods that use this secret
- Verify no connection errors in logs
-
Update pre-commit hooks to PREVENT future incidents: Install git-secrets or similar: git secrets --install git secrets --register-aws git secrets --add-provider 'grep -i "password|api.key|token"'
Hour 4+ (ongoing):
1. Incident post-mortem: - How did the secret get committed? - Why did pre-commit hooks fail? - How do we prevent this in the future? - What's the cost of this incident? -
Implement preventive measures: a) Secret scanning in GitHub (enable free tier): GitHub Settings → Security → Secret scanning This alerts if secrets are detected in new commits
b) Mandatory secret scanning in CI/CD:
- truffleHog: scans for secrets
- detect-secrets: discovers secrets in code
- gitleaks: finds secrets in git history
In CI/CD pipeline:
- Pre-push check (runs locally)
- Pre-commit check (blocks commit)
- CI check (blocks PR merge)
c) Access control:
- Restrict who can push to main/master
- Require PR reviews (including security review)
- Only protected branches allowed
-
Monitoring going forward:
- GitHub secret scanning alerts
- Vault audit logs (if using Vault)
- Database connection anomalies
- API key usage patterns
Assessment of damage (depends on findings):
Scenario A: No unauthorized access detected - Risk is low if secret was rotated quickly - Attacker likely didn't get the secret in time - Action: document incident, implement preventive measuresScenario B: Evidence of unauthorized access
- Critical incident: assume full compromise
- Action: follow Phase 1-5 of the etcd breach playbook
- Rotate ALL secrets, investigate affected systems
Scenario C: Secret was indexed by search engines
- Search engines: request removal via Google Search Console
- Pastebin/mirrors: request removal
- Risk is higher, assume secret may have been accessed
- Rotate more aggressively
Follow-up: How would you design a developer experience where they never need to handle raw secrets in code or config? Design a secrets injection framework.
You're implementing a secrets rotation policy: all API keys rotate every 30 days, all database passwords rotate every 14 days. Your automation works fine, but you hit a problem: the new secret is created in Vault, but old services still use the old secret and can't connect. You have thousands of pods. Manual restart isn't scalable. How do you handle seamless rotation without service disruption?
Seamless secret rotation requires: (1) dual credential support during rotation, (2) rapid secret refresh in pods, (3) health checking to verify connectivity after rotation.
Challenge: Kubernetes doesn't automatically update environment variables or mounted files when a Secret changes. Pods keep using stale values until restarted.
Solution 1: Grace period with dual credentials b) Update Vault with BOTH credentials:
vault write secret/data/prod/db-password c) Pods read new credential immediately (no restart needed if using vault-sidekcar) d) Keep old credential active for 1 hour grace period e) After grace period, revoke old credential:
ALTER USER app_user_old NOLOGIN; # Disable old user
1. During rotation, both old and new credentials work:
a) Create new credential in target system (database, AWS, etc.)
- PostgreSQL: CREATE USER app_user_new WITH PASSWORD 'new-password'
- AWS: Create new access key
- API provider: Generate new key
username=app_user_new
password=new-password
username_old=app_user_old
password_old=old-password
Pod runs vault-agent sidecar
Vault-agent auto-refreshes secrets every 60 seconds
Application reads from templated file (updated in real-time)
Vault agent template (updates in real-time):
{{ with secret "secret/data/prod/db-password" -}}
export DB_USER="{{ .Data.data.username }}"
export DB_PASSWORD="{{ .Data.data.password }}"
{{ end }}
Solution 2: Automatic pod restart via ExternalSecret rotation detection apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 5m # Check Vault every 5 minutes
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials
creationPolicy: Owner
# IMPORTANT: Add a label that changes with secret content
template:
metadata:
labels:
secret-hash: "{{ .secret.Data.password | b64enc | trunc 8 }}" Deploy a controller that watches for label changes:
apiVersion: v1
kind: ConfigMap
metadata:
name: secret-rotation-watcher
data:
watcher.py: |
import subprocess
import hashlib
import time def get_secret_hash(secret_name, namespace):
result = subprocess.run(['kubectl', 'get', 'secret', secret_name, '-n', namespace, '-o', 'jsonpath={.data.password}'],
capture_output=True)
return hashlib.sha256(result.stdout).hexdigest() old_hash = None
while True:
current_hash = get_secret_hash('db-credentials', 'production')
if old_hash and current_hash != old_hash:
# Secret changed! Restart pods
subprocess.run(['kubectl', 'rollout', 'restart', 'deployment/app', '-n', 'production'])
old_hash = current_hash
time.sleep(60) Deploy the watcher:
kubectl apply -f secret-rotation-watcher-job.yaml
1. When ExternalSecret detects secret change, trigger pod restart:
Solution 3: Service mesh integration (most elegant) kind: DestinationRule
metadata:
name: database-mtls
spec:
host: db.prod.svc.cluster.local
trafficPolicy:
tls:
mode: MUTUAL # mTLS
clientCertificate: /etc/ssl/certs/client-cert.pem
clientKey: /etc/ssl/certs/client-key.pem
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 2
h2UpgradePolicy: UPGRADE
apiVersion: security.istio.io/v1beta1
With mTLS via service mesh:
- Certificates are managed by Istio, auto-rotated
- No application code changes needed
- Pods don’t restart, Istio handle cert refresh
- Connection pool handles transient failures during rotation
Solution 4: Health checks + graceful degradation
During rotation, some pods may temporarily fail:
-
Implement retry logic in application: def connect_to_db(max_retries=3): for attempt in range(max_retries): try: conn = psycopg2.connect( dbname="mydb", user=os.getenv("DB_USER"), password=os.getenv("DB_PASSWORD"), host="db.prod.svc.cluster.local" ) return conn except psycopg2.OperationalError as e: if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff else: raise
-
Kubernetes liveness probe checks connectivity: livenessProbe: exec: command:
- /bin/sh
- -c
- | psql -h db.prod.svc.cluster.local -U $DB_USER -d mydb -c "SELECT 1" initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3
If pod fails liveness check (can’t connect with rotated secret),
Kubernetes automatically restarts it. On restart, it reads new secret from K8s Secret
(which was updated by ExternalSecret)
Solution 5: Zero-downtime rotation orchestration
apiVersion: batch/v1
kind: CronJob
metadata:
name: secret-rotation-orchestrator
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: rotator
image: rotation-orchestrator:latest
env:
- name: GRACE_PERIOD_SECONDS
value: "3600" # 1 hour grace period
- name: BATCH_SIZE
value: "10" # Restart 10 pods at a time
script: |
#!/bin/bash
# Step 1: Create new credential in Vault vault kv put secret/prod/db-password username_new=$NEW_USER password_new=$NEW_PASS
# Step 2: ExternalSecret automatically syncs new credential sleep 60 # Wait for ExternalSecret to refresh
# Step 3: Restart pods in batches (with health checks between) for deployment in app1 app2 app3; do kubectl rollout restart deployment/$deployment -n production kubectl rollout status deployment/$deployment -n production --timeout=5m done
# Step 4: Grace period (keep both credentials active) sleep 3600
# Step 5: Revoke old credential vault kv patch secret/prod/db-password username_old="" password_old="" # PostgreSQL: ALTER USER app_user_old NOLOGIN;
Best practice checklist:
[] Use Vault + ExternalSecret (auto-refresh secrets)
[] Implement health checks for post-rotation verification
[] Grace period: keep old + new credentials during rotation
[] Batch pod restarts: avoid thundering herd
[] Monitor rotation process: alert on failures
[] Automatic rollback if rotation fails
[] Test rotation in staging first
[] Document rotation procedure for manual override
[] Audit all rotation events
Follow-up: What happens if a secret rotation fails midway (e.g., new credential creation fails, but old one was already revoked)? Design a rollback mechanism.
A developer on your team is asking for a way to access production secrets locally for debugging. Currently, it's impossible because secrets are only in Vault/Kubernetes. You want to enable this safely, but with strong guardrails: minimal privileges, temporary access, full audit trail, automatic expiry, and approval. Design a "dev escape hatch" for production secret access.
This is a critical security question: balancing operational agility (developers need to debug) with security (minimize exposure of production secrets).
Solution: Temporary elevated access with approval, TTL, and auditing Step 1: Developer requests temporary secret access
$ vault-breakglass request secret:prod/db-password --reason "Debug connection pool issue" --duration 1h
Request ID: bkg-2024-04-07-001
Status: PENDING Step 2: Approval workflow (goes to Slack/PagerDuty)
@on-call-sre please approve secret access for @alice
Reason: Debug connection pool issue
Secret: prod/db-password
Duration: 1 hour [APPROVE] [DENY] [APPROVE WITH RESTRICTIONS] Step 3: SRE reviews and approves
/approve bkg-2024-04-07-001 Step 4: Vault issues temporary credential
$ vault-breakglass retrieve bkg-2024-04-07-001
DB_PASSWORD=temporary-secret-only-valid-for-1-hour
Expires: 2024-04-07T20:00:00Z This credential:
- Works only for 1 hour
- Can only read (not modify)
- Is tagged in audit logs
- Will auto-revoke at expiry Step 5: Automatic cleanup
After 1 hour, Vault:
- Revokes the credential
- Logs the revocation
- Notifies the developer
- Removes access Implementation using Vault + external tooling: apiVersion: v1
kind: ConfigMap
metadata:
name: breakglass-policy
data:
breakglass.hcl: |
# Only used for temporary breakglass access
path "secret/data/prod/*" {
capabilities = ["read"]
# Note: NO write, delete, or admin capabilities
} apiVersion: v1
kind: ServiceAccount
metadata:
name: breakglass-approver
namespace: vault-admin apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: breakglass-approver
rules: apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: breakglassrequests.vault.internal
spec:
names:
kind: BreakglassRequest
plural: breakglassrequests
scope: Namespaced
versions:
Architecture:
Developer → Request tool → Approval workflow → Vault issues temporary credential → Automatic expiry
Breakglass request CRD
Approval workflow implementation (CLI tool): case $1 in
request)
SECRET=$2
REASON=$3
DURATION=${4:-3600} # Validate inputs
if [[ ! $SECRET =~ ^secret/prod/ ]]; then
echo "ERROR: Can only request prod secrets"
exit 1
fi if [[ $DURATION -gt 3600 ]]; then
echo "ERROR: Max duration is 1 hour"
exit 1
fi # Create Kubernetes CRD for request
kubectl apply -f - <<EOF
apiVersion: vault.internal/v1
kind: BreakglassRequest
metadata:
name: bkg-$(date +%s)-$RANDOM
namespace: vault-admin
spec:
requesterEmail: $(git config user.email)
secretPath: $SECRET
reason: "$REASON"
durationSeconds: $DURATION
EOF # Notify approvers via Slack webhook
curl -X POST $SLACK_WEBHOOK -d '{
"text": "New breakglass request",
"blocks": [
{"type": "section", "text": {"type": "mrkdwn", "text": "Secret: '$SECRET'\nReason: '$REASON'\nDuration: '$DURATION’s"}},
{"type": "actions", "elements": [
{"type": "button", "text": {"type": "plain_text", "text": "Approve"}, "value": "approve", "action_id": "breakglass_approve"},
{"type": "button", "text": {"type": "plain_text", "text": "Deny"}, "value": "deny", "action_id": "breakglass_deny"}
]}
]
}'
;; retrieve)
REQUEST_ID=$2 # Check if approved
STATUS=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.approved}')
if [[ $STATUS != "true" ]]; then
echo "ERROR: Request not approved"
exit 1
fi # Check if expired
EXPIRES=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.expiresAt}')
if [[ $(date +%s) -gt $(date -d $EXPIRES +%s) ]]; then
echo "ERROR: Request has expired"
exit 1
fi # Issue temporary credential from Vault
vault read -format=json auth/approle/role/breakglass/secret-id
#!/bin/bash
vault-breakglass request
metadata="request_id=$REQUEST_ID" | jq -r '.auth.client_token'
;;
esac
Audit and compliance:
apiVersion: v1
kind: ConfigMap
metadata:
name: vault-audit-breakglass
data:
audit-policy.json: |
{
"type": "file",
"options": {
"file_path": "/var/log/vault/audit.log",
"log_raw": true
}
}
All breakglass access is logged:
- Who requested
- What secret
- When
- For how long
- Who approved
- What they actually read/did
- When it expired
Monthly audit report:
vault audit logs | jq 'select(.auth.metadata.breakglass == "true")' |
group_by(.auth.display_name) |
map({user: .[0].auth.display_name, accesses: length})
Compliance and guardrails:
[] Max duration: 1 hour
[] Reason: required and audited
[] Approval: required (no self-approval)
[] Scope: read-only access
[] Audit: all access logged
[] Auto-expiry: credential revoked after TTL
[] Alerting: unusual patterns trigger alerts
Unusual patterns alert:
-
alert: HighBreakglassUsage expr: count(rate(vault_breakglass_requests_total[5m])) > 5 annotations: summary: "More than 5 breakglass requests/min (unusual activity)"
-
alert: LongDurationBreakglass expr: vault_breakglass_duration_seconds > 3600 annotations: summary: "Breakglass request exceeded max duration"
Usage examples:
# Request prod database password for 30 minutes
vault-breakglass request secret:prod/db-password --reason "Debug slow queries" --duration 1800
List pending requests (for approvers)
kubectl get breakglassrequests -n vault-admin
Retrieve the credential (works for 30 min, then auto-expires)
vault-breakglass retrieve bkg-2024-04-07-001
Follow-up: How would you detect if a developer abused their temporary breakglass access (e.g., exfiltrated secrets, modified data, or accessed secrets they shouldn't)? Design a detection system.
You're switching from storing database passwords in environment variables to using a secrets manager (Vault). The problem: existing deployments have the old passwords in ConfigMaps and environment variables. You can't delete them until all pods are restarted. But if you restart all pods at once, your cluster goes down. Design a safe migration where you run both old and new secrets in parallel.
Secrets migration requires zero-downtime switchover. You can't cut over abruptly.
Strategy:
Phase 1: Deploy Vault and sync secrets (both old and new active)
Phase 2: Update 10% of pods to read from Vault
Phase 3: Monitor, validate, then expand to 50%, then 100%
Phase 4: Delete old ConfigMaps/env vars once 100% migrated
Implementation:
Original deployment uses env vars:
spec:
containers:
- name: app
env:
- name: DB_PASSWORD
valueFrom:
configMapKeyRef:
name: db-creds
key: password
Step 1: Deploy ExternalSecret alongside:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials-vault
spec:
secretStoreRef:
name: vault
target:
name: db-credentials-from-vault
data:
- secretKey: password
remoteRef:
key: prod/db-password
Step 2: Canary: 10% of pods use Vault secret
spec:
containers:
- name: app
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials-from-vault # NEW: from Vault
key: password
Step 3: Monitor error rates, database connection errors, latency
If issues: can revert by scaling back canary
Step 4: Gradually expand to 25%, 50%, 100%
Step 5: Delete old ConfigMaps
kubectl delete configmap db-creds
Validation during each phase:
- Application logs show no connection errors
- Database connection pool healthy
- Vault audit logs show secret access
- No increase in error rate
Follow-up: What happens if a pod becomes unable to reach Vault during this migration? Design a fallback mechanism.
You manage secrets for 50 microservices. Each service has 5-10 secrets (API keys, database credentials, OAuth tokens, etc.). Developers keep asking for access to specific secrets ("I need the Stripe key to debug payment processing"). You can't give them production access directly, but denying breaks their productivity. Design a secure but frictionless model.
Balance: security without friction. Developers need enough access to be productive, but with guardrails.
Model:
- Developers read staging secrets freely (low security risk)
- Production secrets: read-only access via audit logs
- Sensitive operations (modify secrets): require approval
- Access automatically revokes after TTL
Tiered access: TIER 2: Developer (Production, Debug) TIER 3: Developer (Production, Modify) TIER 4: Ops/SRE (All)
TIER 1: Developer (Staging)
- Read all staging secrets
- No TTL (permanent)
- No approval needed
Implementation with Vault: vault kv get secret/staging/myapp/db-password Dev can request production password (read-only):
vault kv get secret/prod/myapp/db-password --ttl=24h Dev wants to rotate production password:
vault request change-secret secret/prod/myapp/db-password
Dev can request staging password:
Works immediately, no approval
Works with TTL, logged, no approval (auto-grant policy)
Requires approval from SRE, only then can they update
Vault policy:
path "secret/staging/*" {
capabilities = ["read", "list"]
Developers can read staging freely
}
path "secret/prod/*" { capabilities = ["read", "list"]
Read-only production (no create/update)
ttl = "24h" }
path "secret/prod/*/modify" { capabilities = ["update"]
Requires approval
}
Monitoring suspicious patterns:
- Dev reading same secret 100x in 1 hour (exfiltration attempt?)
- Dev reading secrets outside their team
- Dev requesting modifications without reason
- Off-hours access patterns
Follow-up: How do you prevent developers from exfiltrating production secrets they're allowed to read?