Your etcd backup (10GB) containing your entire Kubernetes cluster state was accidentally uploaded to S3 public bucket. You have 50,000+ Kubernetes Secrets in your cluster (API keys, database passwords, TLS certificates). An attacker has the backup. Answer immediately: Are your secrets compromised? How do you respond? What's your investigation and remediation timeline?

Critical question: Were secrets encrypted at rest in etcd? The answer determines if you have a catastrophic incident or a containable breach.

Phase 1: Immediate triage (0-5 minutes)
1. Check if etcd encryption was enabled: kubectl get configmap -n kube-system encryption-config -o yaml # If not found, encryption was NOT enabled = secrets are in plaintext in the backup


Check kube-apiserver flags:
kubectl describe pod -n kube-system kube-apiserver-* | grep -E 'encryption|auth'

`Look for: --encryption-provider-config flag`

If encryption is NOT enabled (likely scenario): CRITICAL INCIDENT
All secrets are in plaintext in the etcd backup:



Database credentials
API keys (AWS, GitHub, Stripe)
OAuth tokens
TLS certificates
Service account tokens

Attacker can:



Immediately access all external services (cloud APIs, databases, SaaS)


Impersonate any service account


Decrypt traffic between services

Escalate to full cluster compromise

Phase 2: Emergency response (5-30 minutes)
ASSUME FULL COMPROMISE. Do this in parallel:



Revoke all secrets immediately:
for secret in $(kubectl get secrets -A -o jsonpath='{.items[*].metadata.name}'); do


If it’s a credential (api-key, password, token), mark it revoked
echo "REVOKE: $secret" >> /tmp/compromised-secrets.txt
done


Rotate ALL external credentials:
a) Database passwords:

Change all DB passwords
Update connection strings in ConfigMaps
Restart affected pods
b) API keys:
Revoke old keys in Stripe, AWS, GitHub, etc.
Generate new keys
Update Kubernetes Secrets
Restart pods consuming these keys
c) OAuth tokens:
Invalidate tokens at OAuth provider
Generate new tokens
d) TLS certificates:
If CA private key was in the backup, certificates are compromised
Issue new certificates with new CA
This is EXPENSIVE but necessary



Roll service account tokens:
kubectl create serviceaccount temp-sa -n production


Copy permissions from old SA to temp-sa
Update all pod specs to use temp-sa
Eventually delete old SAs

Monitor for suspicious activity: kubectl logs -n kube-system kube-apiserver-* --tail=500 | grep -E 'unauthorized|forbidden|token.*invalid' Check cloud provider (AWS) for unusual API calls Check databases for unauthorized access

Phase 3: Investigation (30 minutes - 2 hours)
1. Analyze the etcd backup to determine exposure:


Extract secrets from the backup (simulating attacker’s view)
etcdctl --endpoints=file:///backup/etcd-snapshot snapshot restore --data-dir=/tmp/etcd-test
Query the restored etcd for secrets
etcdctl get --from-key '' /kubernetes.io/secrets/ | grep -E 'password|token|key' | wc -l
Count: how many secrets were exposed?

Classify exposed secrets:


High: Database credentials, API keys (can’t easily rotate)
Medium: OAuth tokens, temporary keys (can be rotated)
Low: TLS certs (can be replaced with new CA)


Determine impact per service:
For each exposed credential:


Which services use it?
What damage could attacker do?
How quickly can we rotate it?
Is it used in critical path?


Check logs for actual compromise:
kubectl logs -A --all-containers=true --since=2h | grep -E 'auth.*failed|unauthorized.*attempt|suspicious'

`Did attacker actually access something, or just had the file?`

Phase 4: Remediation (2-8 hours)
Priority 1: Rotate HIGH-risk secrets (database creds, production API keys)




Create new secret in Kubernetes:
kubectl create secret generic db-credentials-new 

–from-literal=password=$(openssl rand -base64 32) 

-n production


Update deployments to use new secret:
kubectl patch deployment app -n production -p 

'{"spec":{"template":{"spec":{"containers":[{"name":"app","env":[{"name":"DB_PASSWORD","valueFrom":{"secretKeyRef":{"name":"db-credentials-new"}}}]}]}}}}'


Restart pods:
kubectl rollout restart deployment/app -n production


Update the external service (database, AWS account, etc.):


DB: ALTER USER app_user IDENTIFIED BY 'new-password';
AWS: Create new IAM access key, revoke old one
Stripe: Create new API key, revoke old one
Priority 2: Rotate MEDIUM-risk secrets (OAuth tokens, session keys)

Similar process to Priority 1, but less critical path impact

Priority 3: Replace TLS certificates


This is expensive (requires new CA if private key was exposed)
Can be done in parallel with other rotations

Rolling update of TLS secrets to all affected pods

Phase 5: Long-term hardening
Enable encryption at rest in etcd (SHOULD HAVE BEEN DONE):




Generate encryption key:
openssl rand -base64 32 > /etc/kubernetes/encryption-key


Create encryption config:
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:



resources:

secrets
providers:
aescbc:
keys:

name: key1
secret: 


identity: {}  # Fallback to unencrypted (for graceful migration)





Update kube-apiserver:
–encryption-provider-config=/etc/kubernetes/encryption-config.yaml
–encryption-provider-config-automatic-reload=true


Restart kube-apiserver:
kubectl rollout restart deployment/kube-apiserver -n kube-system


Re-encrypt existing secrets:
kubectl get secrets --all-namespaces -o json | 

kubectl apply -f -

`This forces etcd to re-encrypt all existing secrets with the new key`

Now, if etcd backup is exfiltrated again, secrets are encrypted:
etcdctl snapshot info /backup/etcd-snapshot


Shows data, but secrets are encrypted

`Attacker has only encrypted blobs, not plaintext credentials`

Phase 6: Backup security hardening
1. Encrypt backups: etcdctl snapshot save /backups/etcd-$(date +%s).db --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/etcd.pem --key=/etc/etcd/etcd-key.pem


gpg --symmetric --cipher-algo AES256 /backups/etcd-*.db
Encrypt backup with GPG

Restrict backup storage:


Private S3 bucket with encryption-at-rest
No public access
Restricted IAM roles for access
MFA required for bucket access


Audit backup access:
aws s3api get-bucket-logging --bucket etcd-backups

Enable S3 access logging
Monitor for unexpected downloads

Implement backup versioning and retention:


Keep multiple versions (can’t delete all at once)
Delete old backups after 90 days (or your retention policy)

This limits exposure window if backup is found

Phase 7: Post-incident checklist
[] All HIGH-risk secrets rotated

[] All MEDIUM-risk secrets rotated [] TLS certificates replaced (if private key was exposed) [] Encryption at rest enabled in etcd [] etcd backups now encrypted [] Backup storage hardened [] No suspicious activity detected in logs (24h monitoring) [] Team trained on secret management [] Post-mortem scheduled (review how backup got leaked)

Follow-up: How would you detect if an attacker actually used the compromised secrets? Design a monitoring and alerting strategy for anomalous credential usage.

You're implementing EncryptionConfiguration in Kubernetes to encrypt secrets at rest in etcd. A bug in your migration causes the system to mark all secrets as "unencrypted" even after re-encryption. You realize that 30,000 new secrets created post-migration are not actually encrypted. How do you detect this, quantify the damage, and fix it?

This is a subtle but critical bug: encryption configuration exists, but secrets aren't actually being encrypted. The system has no way to know which secrets are encrypted and which aren't.

Phase 1: Detection
1. Check encryption configuration: kubectl get configmap -n kube-system encryption-config -o yaml # Verify encryption-provider-config is set


Verify kube-apiserver is using encryption:
ps aux | grep kube-apiserver | grep encryption-provider-config


Should see flag

Check etcd for unencrypted data:
etcdctl get /kubernetes.io/secrets/default/my-secret

If you can read plaintext values, secrets are NOT encrypted

Expected encrypted output: binary gibberish (AES-CBC encrypted) Actual (unencrypted): readable JSON with plaintext values

Phase 2: Quantify the damage
1. Determine which secrets are unencrypted: kubectl get secrets -A -o json | jq -r '.items[] | "(.metadata.namespace)/(.metadata.name) (.metadata.creationTimestamp)"' > /tmp/all-secrets.txt



Compare creation timestamp to migration date:
MIGRATION_DATE="2024-04-05T00:00:00Z"
grep -v "$(echo $MIGRATION_DATE | cut -d’T’ -f1)" /tmp/all-secrets.txt > /tmp/post-migration-secrets.txt
wc -l /tmp/post-migration-secrets.txt

`Count: ~30,000 unencrypted secrets`

Phase 3: Root cause analysis
1. Check encryption provider logs: kubectl logs -n kube-system kube-apiserver-* | grep -i encrypt | tail -50

Common bugs: a) Encryption key not loaded properly: Check key file exists: ls -la /etc/kubernetes/encryption-key Check file permissions: stat /etc/kubernetes/encryption-key b) Identity provider is primary (should be fallback): apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: resources: secrets providers: identity: {} # BUG: This is first, so secrets are NOT encrypted aescbc: keys: name: key1 secret: …
c) Reload wasn’t working: kube-apiserver should have --encryption-provider-config-automatic-reload=true If reload is broken, old config persists

Phase 4: Fix (safely, without data loss)
1. FIX the encryption configuration order:


apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:

resources:
- identity: {} # Fallback AFTER encryption attempt
  2. Update kube-apiserver config and restart:
  kubectl set env deployment/kube-apiserver -n kube-system ENCRYPTION_CONFIG_UPDATED="true" kubectl rollout restart deployment/kube-apiserver -n kube-system


Monitor rollout

kubectl rollout status deployment/kube-apiserver -n kube-system
3. Force re-encryption of all secrets: kubectl get secrets --all-namespaces -o json | kubectl apply -f -

`This re-writes all secrets, forcing them to be encrypted with the fixed config`

4. Verify encryption worked: etcdctl get /kubernetes.io/secrets/default/my-secret

`Should now show encrypted binary data, not plaintext`


Should show binary/hex, not ASCII plaintext
done

Write a new secret and verify it’s encrypted:
kubectl create secret generic test-secret --from-literal=key=value
etcdctl get /kubernetes.io/secrets/default/test-secret | od -c

Should show binary gibberish, not plaintext "value"

Audit log should show encryption was applied: kubectl logs -n kube-system kube-apiserver-* --tail=100 | grep -i 'secret.*encrypt|encrypt.*success'

Phase 6: Prevention for the future
1. Automated verification test:


apiVersion: batch/v1
kind: CronJob
metadata:
name: verify-encryption
spec:
schedule: "0 * * * *"  # Every hour
jobTemplate:
spec:
template:
spec:
containers:
- name: verify
image: alpine:latest
command:
- /bin/sh
- -c
- |
# Create a test secret
TEST_SECRET_NAME="encryption-test-$(date +%s)"
kubectl create secret generic $TEST_SECRET_NAME --from-literal=test="plaintext-should-not-appear"
# Check if it’s encrypted in etcd
ETCD_DATA=$(etcdctl get /kubernetes.io/secrets/default/$TEST_SECRET_NAME)
if echo "$ETCD_DATA" | grep -q "plaintext-should-not-appear"; then
echo "ALERT: Secrets are NOT encrypted!"
exit 1
fi
# If we get here, encryption is working
kubectl delete secret $TEST_SECRET_NAME
echo "Encryption verified OK"

Policy as code (using tools like Kubewarden):
apiVersion: policies.kubewarden.io/v1
kind: ClusterAdmissionPolicy
metadata:
name: enforce-encrypted-secrets
spec:
policyServer: default
module: ghcr.io/kubewarden/secrets-encryption
rules:


apiGroups: [""]
apiVersions: ["v1"]
resources: ["secrets"]
operations: ["CREATE", "UPDATE"]

`This policy verifies that secrets are actually encrypted before allowing creation`

Phase 7: Post-incident remediation
[] Fixed encryption configuration [] Re-encrypted all 30,000 unencrypted secrets [] Verified sample secrets are now encrypted [] Enabled hourly verification test [] Updated runbook for future encryption issues [] Team training on EncryptionConfiguration gotchas

Follow-up: How would you perform a key rotation for encrypted secrets without downtime? Design a graceful transition from old encryption key to new one.

You're managing secrets for 200+ microservices across staging and production. Services need credentials for: databases, cloud APIs, third-party SaaS, internal services. Currently all secrets are stored as Kubernetes Secrets, but this creates problems: secrets are scattered, hard to audit, and developers keep asking for access. You want to move to a centralized secrets manager (Vault). Design a migration strategy that keeps both working in parallel.

Migrating from Kubernetes Secrets to Vault is a big architectural change. The safest approach is running both in parallel, gradually moving services to Vault.

Phase 1: Design the target architecture
┌─────────────────────────────────────────┐ │ HashiCorp Vault │ │ ┌────────────────────────────────────┐ │ │ │ /secret/data/prod/db-password │ │ │ │ /secret/data/prod/stripe-api-key │ │ │ │ /secret/data/staging/test-db-pass │ │ │ └────────────────────────────────────┘ │ │ Features: audit logs, MFA, TTL, rotation │ └─────────────────────────────────────────┘ ↑ │ (pod auth via Kubernetes ServiceAccount) │ ┌──────────────────────────┐ │ Kubernetes Pod │ │ Sidecar proxy: vault-sidekcar pulls secrets from Vault │ OR │ External Secrets operator: syncs Vault→K8s Secrets └──────────────────────────┘

Phase 2: Deploy Vault in cluster
1. Install Vault via Helm: helm repo add hashicorp https://helm.releases.hashicorp.com helm install vault hashicorp/vault \ --set server.dataStorage.size=10Gi \ --set server.statefulSet.replicas=3 \ --namespace vault \ --create-namespace


Initialize Vault:
kubectl exec -n vault vault-0 – vault operator init 

–key-shares=5 

–key-threshold=3


Save unseal keys securely!



Unseal Vault:
for i in 1 2 3; do
kubectl exec -n vault vault-0 – vault operator unseal KEY_$i
done

Configure Kubernetes auth: kubectl exec -n vault vault-0 – vault auth enable kubernetes kubectl exec -n vault vault-0 – vault write auth/kubernetes/config kubernetes_host=https://kubernetes.default.svc.cluster.local:443 kubernetes_ca_cert="@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

Phase 3: Setup External Secrets Operator (for parallel running)
1. Install External Secrets:


helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets 

–namespace external-secrets-system 

–create-namespace


Create SecretStore (Vault connection):
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "http://vault.vault.svc.cluster.local:8200"
path: "secret"
auth:
kubernetes:
mountPath: "kubernetes"
role: "default"


Create ExternalSecret to sync from Vault to K8s:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials  # Creates K8s Secret with this name
creationPolicy: Owner
data:



secretKey: password
remoteRef:
key: prod/db-password

This automatically syncs Vault secret → Kubernetes Secret

`Pods can use K8s Secret as before, but data comes from Vault`

Phase 4: Migrate secrets incrementally
Wave 1: Non-critical, staging secrets (Week 1)



For each staging service:

Create Vault path: /secret/data/staging/service-name/*
Populate with secrets from K8s
Deploy ExternalSecret
Verify pods work with synced secrets
Delete old K8s Secret



Wave 2: Non-critical, production secrets (Week 2)

Same process, but with production services
Schedule during low-traffic window
Have rollback plan

Wave 3: Critical production secrets (Week 3-4)

Take more time, extra validation
A/B test: 50% of pods using Vault, 50% using K8s Secret
Monitor error rates before full migration

PARALLEL STATE (safest): Some services still use K8s Secrets Other services synced from Vault via ExternalSecret Both work simultaneously

Phase 5: Setup audit logging in Vault
apiVersion: v1 kind: ConfigMap metadata: name: vault-audit-config data: audit-config.hcl: | audit { file { path = "/vault/logs/audit.log" } }


Now all secret access is logged:
- Who accessed the secret
- When
- From what IP

`- What action (read, create, delete)`

Phase 6: Secret rotation policy
1. For database credentials (need coordinated rotation): apiVersion: pam.vault.io/v1alpha1 kind: RotationPolicy metadata: name: db-rotation spec: secretPath: secret/data/prod/db-password rotationInterval: 30d # Rotate every 30 days rotation: databaseConnection: engine: postgresql connectionUri: "postgres://root:password@db.internal:5432/postgres" rotationStatements: - "ALTER USER app_user WITH PASSWORD '{{ password }}'"


Vault automatically:
1. Generates new password
2. Updates database
3. Updates secret in Vault
4. Pods pick up new secret via ExternalSecret refresh

For API keys (can be rotated independently): kubectl exec -n vault vault-0 – vault write -f /secret/data/prod/stripe-api-key value=$(curl -H "Authorization: Bearer $STRIPE_TOKEN" https://api.stripe.com/v1/api_keys/generate | jq -r '.key')

Phase 7: Access control with policies
Create fine-grained Vault policies:


apiVersion: v1
kind: ConfigMap
metadata:
name: vault-policies
data:
developer.hcl: |
path "secret/data/staging/*" {
capabilities = ["read", "list"]
}
# Developers can only read staging secrets, not production
production-reader.hcl: |
path "secret/data/prod/*" {
capabilities = ["read"]
}
# Production services can read their secrets
sre.hcl: |
path "secret/*" {
capabilities = ["read", "list", "create", "update", "delete"]
}
# SREs have full access

admin.hcl: | path "/" { capabilities = [""] } # Only cluster admin

Phase 8: Monitoring and alerting
- alert: VaultSealed expr: vault_unsealed_status == 0 for: 1m annotations: summary: "Vault is sealed - pods can’t retrieve secrets!"



alert: UnauthorizedVaultAccess
expr: rate(vault_audit_log_error_total[5m]) > 0.1
annotations:
summary: "Unauthorized access attempts to Vault"

alert: ExternalSecretSyncFailure expr: externalsecret_status_sync_failures_total > 0 for: 5m annotations: summary: "ExternalSecret failed to sync from Vault"

Final state after migration:
OLD (Kubernetes Secrets only):



Secrets scattered across namespaces


No audit trail


Hard to rotate


Developers have broad Secret access via RBAC


NEW (Vault + ExternalSecret):


Centralized secret management
Full audit trail of all access
Automatic rotation
Fine-grained access control per service
Secrets encrypted at rest in Vault

Can use for non-Kubernetes systems too

Follow-up: How would you handle a situation where a developer needs to access a production secret in an emergency (e.g., to debug a production issue)? Design a "break glass" approval workflow in Vault.

A developer committed a secret (database password) to Git by accident. The commit is in your public GitHub repo. Yes, your pre-commit hooks were supposed to catch this, but they failed. Thousands of clones have happened. What's your response? How do you assess damage? What do you do in hours 0-1, 1-4, and beyond?

This is a critical incident. The secret is now visible to anyone who cloned the repo, and may be indexed by search engines or security scanners.

Hour 0 (immediate, first 5 minutes):
1. Confirm the leak: git log --oneline | grep -i secret git show | grep -E 'password|key|token' # Identify exactly what was leaked
2. Revoke the secret IMMEDIATELY: - If database password: change it NOW - If API key: revoke it NOW - If AWS credential: disable it NOW - If OAuth token: invalidate it NOW

Command example (PostgreSQL): ALTER USER app_user WITH PASSWORD 'new-secure-password-12345678';
3. Notify stakeholders: "Database password for production has been exposed in GitHub. Password has been rotated. No unauthorized access detected yet. Starting incident response."

Hour 0-1 (first hour):
1. Rewrite Git history (remove secret from all commits):


DO NOT do this lightly - it breaks clones for everyone
But it’s necessary to remove secret from GitHub history
Option A: BFG Repo-Cleaner (simpler):
bfg --delete-files production.env repo
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Option B: Git filter-branch (more control):
git filter-branch -f --tree-filter 'rm -f production.env' HEAD
Force push to remote (WARNING: breaks all existing clones):
git push origin --force --all

Verify secret is gone from GitHub history:
curl https://api.github.com/repos/company/repo/commits | jq '.[] | .sha' | 

while read sha; do
git show $sha | grep -q "password=secret123" && echo "FOUND in $sha"
done

Should find nothing now

Monitor for secret use:
a) Database logs:
tail -f /var/log/postgresql/postgresql.log | grep -E 'auth.*fail|invalid.*password'
Any failed auth attempts from new IPs?


b) Application logs:
kubectl logs -f -n production -l app=myapp | grep -E 'connection.*refused|auth.*fail'
c) Cloud provider (AWS):
aws cloudtrail lookup-events --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIAIOSFODNN7EXAMPLE
Check for unusual API calls

Check if secret was exposed to search engines:
Google Cache: https://webcache.googleusercontent.com/… GitHub API history (check 3rd party services)
Pastebin, GitLab mirrors, etc.

Hour 1-4 (first 4 hours):
1. Deep forensic investigation: a) Who has access to the repo (collaborators)? b) Who cloned the repo in the last 30 days? c) When was the secret first exposed? d) Is there evidence of the secret being used by attackers?



Check GitHub audit logs:
GitHub Settings → Security log
Look for: unusual clone IPs, API calls from unknown locations


Query cloud provider audit logs:
aws cloudtrail get-event-selectors --trail-name production-trail
Look for: unauthorized DB access, IAM policy changes, new users created


Database forensics:
Check connection logs for source IPs using exposed credentials:
SELECT * FROM pg_stat_statements WHERE query LIKE '%app_user%';
SHOW log_statement;  # Verify logging is enabled


Rotate all related secrets:

New database password
Update Kubernetes Secret
Restart pods that use this secret
Verify no connection errors in logs

Update pre-commit hooks to PREVENT future incidents: Install git-secrets or similar: git secrets --install git secrets --register-aws git secrets --add-provider 'grep -i "password|api.key|token"'

Hour 4+ (ongoing):
1. Incident post-mortem: - How did the secret get committed? - Why did pre-commit hooks fail? - How do we prevent this in the future? - What's the cost of this incident?



Implement preventive measures:
a) Secret scanning in GitHub (enable free tier):
GitHub Settings → Security → Secret scanning
This alerts if secrets are detected in new commits
b) Mandatory secret scanning in CI/CD:

truffleHog: scans for secrets
detect-secrets: discovers secrets in code
gitleaks: finds secrets in git history

In CI/CD pipeline:

Pre-push check (runs locally)
Pre-commit check (blocks commit)
CI check (blocks PR merge)

c) Access control:

Restrict who can push to main/master
Require PR reviews (including security review)
Only protected branches allowed

Monitoring going forward:
GitHub secret scanning alerts Vault audit logs (if using Vault) Database connection anomalies
API key usage patterns

Assessment of damage (depends on findings):
Scenario A: No unauthorized access detected - Risk is low if secret was rotated quickly - Attacker likely didn't get the secret in time - Action: document incident, implement preventive measures
Scenario B: Evidence of unauthorized access
Critical incident: assume full compromise Action: follow Phase 1-5 of the etcd breach playbook Rotate ALL secrets, investigate affected systems Scenario C: Secret was indexed by search engines
Search engines: request removal via Google Search Console Pastebin/mirrors: request removal Risk is higher, assume secret may have been accessed
Rotate more aggressively

Follow-up: How would you design a developer experience where they never need to handle raw secrets in code or config? Design a secrets injection framework.

You're implementing a secrets rotation policy: all API keys rotate every 30 days, all database passwords rotate every 14 days. Your automation works fine, but you hit a problem: the new secret is created in Vault, but old services still use the old secret and can't connect. You have thousands of pods. Manual restart isn't scalable. How do you handle seamless rotation without service disruption?

Seamless secret rotation requires: (1) dual credential support during rotation, (2) rapid secret refresh in pods, (3) health checking to verify connectivity after rotation.

Challenge: Kubernetes doesn't automatically update environment variables or mounted files when a Secret changes. Pods keep using stale values until restarted.

Solution 1: Grace period with dual credentials
1. During rotation, both old and new credentials work: a) Create new credential in target system (database, AWS, etc.) - PostgreSQL: CREATE USER app_user_new WITH PASSWORD 'new-password' - AWS: Create new access key - API provider: Generate new key

b) Update Vault with BOTH credentials: vault write secret/data/prod/db-password username=app_user_new password=new-password username_old=app_user_old password_old=old-password


c) Pods read new credential immediately (no restart needed if using vault-sidekcar)
d) Keep old credential active for 1 hour grace period
e) After grace period, revoke old credential:
ALTER USER app_user_old NOLOGIN;  # Disable old user

Implementation with Vault agent:
apiVersion: v1
kind: Pod
metadata:
name: app-pod
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "app"
vault.hashicorp.com/agent-inject-secret-database: "secret/data/prod/db-password"
vault.hashicorp.com/agent-cache-enable: "true"
vault.hashicorp.com/agent-cache-use-auto-auth-token: "force"
spec:
serviceAccountName: app
containers:

name: app image: myapp:latest env: name: VAULT_ADDR value: "http://vault.vault.svc.cluster.local:8200" Pod runs vault-agent sidecar Vault-agent auto-refreshes secrets every 60 seconds
Application reads from templated file (updated in real-time)
Vault agent template (updates in real-time):
{{ with secret "secret/data/prod/db-password" -}} export DB_USER="{{ .Data.data.username }}" export DB_PASSWORD="{{ .Data.data.password }}" {{ end }}
Solution 2: Automatic pod restart via ExternalSecret rotation detection
1. When ExternalSecret detects secret change, trigger pod restart:


apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 5m  # Check Vault every 5 minutes
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials
creationPolicy: Owner
# IMPORTANT: Add a label that changes with secret content
template:
metadata:
labels:
secret-hash: "{{ .secret.Data.password | b64enc | trunc 8 }}"



Deploy a controller that watches for label changes:
apiVersion: v1
kind: ConfigMap
metadata:
name: secret-rotation-watcher
data:
watcher.py: |
import subprocess
import hashlib
import time
def get_secret_hash(secret_name, namespace):
result = subprocess.run(['kubectl', 'get', 'secret', secret_name, '-n', namespace, '-o', 'jsonpath={.data.password}'],
capture_output=True)
return hashlib.sha256(result.stdout).hexdigest()
old_hash = None
while True:
current_hash = get_secret_hash('db-credentials', 'production')
if old_hash and current_hash != old_hash:
# Secret changed! Restart pods
subprocess.run(['kubectl', 'rollout', 'restart', 'deployment/app', '-n', 'production'])
old_hash = current_hash
time.sleep(60)

Deploy the watcher: kubectl apply -f secret-rotation-watcher-job.yaml

Solution 3: Service mesh integration (most elegant)
apiVersion: security.istio.io/v1beta1


kind: DestinationRule
metadata:
name: database-mtls
spec:
host: db.prod.svc.cluster.local
trafficPolicy:
tls:
mode: MUTUAL  # mTLS
clientCertificate: /etc/ssl/certs/client-cert.pem
clientKey: /etc/ssl/certs/client-key.pem
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 2
h2UpgradePolicy: UPGRADE
With mTLS via service mesh:
- Certificates are managed by Istio, auto-rotated
- No application code changes needed
- Pods don’t restart, Istio handle cert refresh

`- Connection pool handles transient failures during rotation`

Solution 4: Health checks + graceful degradation
During rotation, some pods may temporarily fail:




Implement retry logic in application:
def connect_to_db(max_retries=3):
for attempt in range(max_retries):
try:
conn = psycopg2.connect(
dbname="mydb",
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASSWORD"),
host="db.prod.svc.cluster.local"
)
return conn
except psycopg2.OperationalError as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)  # Exponential backoff
else:
raise


Kubernetes liveness probe checks connectivity:
livenessProbe:
exec:
command:

/bin/sh
-c
|
psql -h db.prod.svc.cluster.local -U $DB_USER -d mydb -c "SELECT 1"
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3



If pod fails liveness check (can’t connect with rotated secret),
Kubernetes automatically restarts it. On restart, it reads new secret from K8s Secret

`(which was updated by ExternalSecret)`

Solution 5: Zero-downtime rotation orchestration
apiVersion: batch/v1 kind: CronJob metadata: name: secret-rotation-orchestrator spec: schedule: "0 2 * * *" # Daily at 2 AM jobTemplate: spec: template: spec: containers: - name: rotator image: rotation-orchestrator:latest env: - name: GRACE_PERIOD_SECONDS value: "3600" # 1 hour grace period - name: BATCH_SIZE value: "10" # Restart 10 pods at a time script: | #!/bin/bash


# Step 1: Create new credential in Vault
vault kv put secret/prod/db-password username_new=$NEW_USER password_new=$NEW_PASS
# Step 2: ExternalSecret automatically syncs new credential
sleep 60  # Wait for ExternalSecret to refresh
# Step 3: Restart pods in batches (with health checks between)
for deployment in app1 app2 app3; do
kubectl rollout restart deployment/$deployment -n production
kubectl rollout status deployment/$deployment -n production --timeout=5m
done
# Step 4: Grace period (keep both credentials active)
sleep 3600

# Step 5: Revoke old credential vault kv patch secret/prod/db-password username_old="" password_old="" # PostgreSQL: ALTER USER app_user_old NOLOGIN;

Best practice checklist:
[] Use Vault + ExternalSecret (auto-refresh secrets) [] Implement health checks for post-rotation verification [] Grace period: keep old + new credentials during rotation [] Batch pod restarts: avoid thundering herd [] Monitor rotation process: alert on failures [] Automatic rollback if rotation fails [] Test rotation in staging first [] Document rotation procedure for manual override [] Audit all rotation events

Follow-up: What happens if a secret rotation fails midway (e.g., new credential creation fails, but old one was already revoked)? Design a rollback mechanism.

A developer on your team is asking for a way to access production secrets locally for debugging. Currently, it's impossible because secrets are only in Vault/Kubernetes. You want to enable this safely, but with strong guardrails: minimal privileges, temporary access, full audit trail, automatic expiry, and approval. Design a "dev escape hatch" for production secret access.

This is a critical security question: balancing operational agility (developers need to debug) with security (minimize exposure of production secrets).

Solution: Temporary elevated access with approval, TTL, and auditing
Architecture: Developer → Request tool → Approval workflow → Vault issues temporary credential → Automatic expiry

Step 1: Developer requests temporary secret access $ vault-breakglass request secret:prod/db-password --reason "Debug connection pool issue" --duration 1h Request ID: bkg-2024-04-07-001 Status: PENDING


Step 2: Approval workflow (goes to Slack/PagerDuty)
@on-call-sre please approve secret access for @alice
Reason: Debug connection pool issue
Secret: prod/db-password
Duration: 1 hour
[APPROVE] [DENY] [APPROVE WITH RESTRICTIONS]
Step 3: SRE reviews and approves
/approve bkg-2024-04-07-001
Step 4: Vault issues temporary credential
$ vault-breakglass retrieve bkg-2024-04-07-001
DB_PASSWORD=temporary-secret-only-valid-for-1-hour
Expires: 2024-04-07T20:00:00Z
This credential:
- Works only for 1 hour
- Can only read (not modify)
- Is tagged in audit logs
- Will auto-revoke at expiry
Step 5: Automatic cleanup
After 1 hour, Vault:
- Revokes the credential
- Logs the revocation
- Notifies the developer
- Removes access
Implementation using Vault + external tooling:
apiVersion: v1
kind: ConfigMap
metadata:
name: breakglass-policy
data:
breakglass.hcl: |
# Only used for temporary breakglass access
path "secret/data/prod/*" {
capabilities = ["read"]
# Note: NO write, delete, or admin capabilities
}
apiVersion: v1
kind: ServiceAccount
metadata:
name: breakglass-approver
namespace: vault-admin
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: breakglass-approver
rules:

apiGroups: ["vault.internal"]
resources: ["breakglassrequests"]
verbs: ["get", "list", "watch", "update"]

Breakglass request CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: breakglassrequests.vault.internal
spec:
names:
kind: BreakglassRequest
plural: breakglassrequests
scope: Namespaced
versions:

name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: requesterEmail: type: string secretPath: type: string reason: type: string durationSeconds: type: integer maxDurationSeconds: type: integer default: 3600 # Max 1 hour status: type: object properties: approved: type: boolean approverEmail: type: string approvedAt: type: string expiresAt: type: string credentialId: type: string
Approval workflow implementation (CLI tool):
#!/bin/bash


vault-breakglass request
case $1 in
request)
SECRET=$2
REASON=$3
DURATION=${4:-3600}
# Validate inputs
if [[ ! $SECRET =~ ^secret/prod/ ]]; then
echo "ERROR: Can only request prod secrets"
exit 1
fi
if [[ $DURATION -gt 3600 ]]; then
echo "ERROR: Max duration is 1 hour"
exit 1
fi
# Create Kubernetes CRD for request
kubectl apply -f - <<EOF
apiVersion: vault.internal/v1
kind: BreakglassRequest
metadata:
name: bkg-$(date +%s)-$RANDOM
namespace: vault-admin
spec:
requesterEmail: $(git config user.email)
secretPath: $SECRET
reason: "$REASON"
durationSeconds: $DURATION
EOF
# Notify approvers via Slack webhook
curl -X POST $SLACK_WEBHOOK -d '{
"text": "New breakglass request",
"blocks": [
{"type": "section", "text": {"type": "mrkdwn", "text": "Secret: '$SECRET'\nReason: '$REASON'\nDuration: '$DURATION’s"}},
{"type": "actions", "elements": [
{"type": "button", "text": {"type": "plain_text", "text": "Approve"}, "value": "approve", "action_id": "breakglass_approve"},
{"type": "button", "text": {"type": "plain_text", "text": "Deny"}, "value": "deny", "action_id": "breakglass_deny"}
]}
]
}'
;;
retrieve)
REQUEST_ID=$2
# Check if approved
STATUS=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.approved}')
if [[ $STATUS != "true" ]]; then
echo "ERROR: Request not approved"
exit 1
fi
# Check if expired
EXPIRES=$(kubectl get breakglassrequest $REQUEST_ID -o jsonpath='{.status.expiresAt}')
if [[ $(date +%s) -gt $(date -d $EXPIRES +%s) ]]; then
echo "ERROR: Request has expired"
exit 1
fi

# Issue temporary credential from Vault vault read -format=json auth/approle/role/breakglass/secret-id metadata="request_id=$REQUEST_ID" | jq -r '.auth.client_token' ;; esac

Audit and compliance:
apiVersion: v1 kind: ConfigMap metadata: name: vault-audit-breakglass data: audit-policy.json: | { "type": "file", "options": { "file_path": "/var/log/vault/audit.log", "log_raw": true } }


All breakglass access is logged:

Who requested
What secret
When
For how long
Who approved
What they actually read/did
When it expired

Monthly audit report: vault audit logs | jq 'select(.auth.metadata.breakglass == "true")' | group_by(.auth.display_name) | map({user: .[0].auth.display_name, accesses: length})

Compliance and guardrails:
[] Max duration: 1 hour [] Reason: required and audited [] Approval: required (no self-approval) [] Scope: read-only access [] Audit: all access logged [] Auto-expiry: credential revoked after TTL [] Alerting: unusual patterns trigger alerts


Unusual patterns alert:



alert: HighBreakglassUsage
expr: count(rate(vault_breakglass_requests_total[5m])) > 5
annotations:
summary: "More than 5 breakglass requests/min (unusual activity)"

alert: LongDurationBreakglass expr: vault_breakglass_duration_seconds > 3600 annotations: summary: "Breakglass request exceeded max duration"

Usage examples:
# Request prod database password for 30 minutes


vault-breakglass request secret:prod/db-password --reason "Debug slow queries" --duration 1800
List pending requests (for approvers)
kubectl get breakglassrequests -n vault-admin
Retrieve the credential (works for 30 min, then auto-expires)

vault-breakglass retrieve bkg-2024-04-07-001

Follow-up: How would you detect if a developer abused their temporary breakglass access (e.g., exfiltrated secrets, modified data, or accessed secrets they shouldn't)? Design a detection system.

You're switching from storing database passwords in environment variables to using a secrets manager (Vault). The problem: existing deployments have the old passwords in ConfigMaps and environment variables. You can't delete them until all pods are restarted. But if you restart all pods at once, your cluster goes down. Design a safe migration where you run both old and new secrets in parallel.

Secrets migration requires zero-downtime switchover. You can't cut over abruptly.

Strategy:
Phase 1: Deploy Vault and sync secrets (both old and new active)
Phase 2: Update 10% of pods to read from Vault
Phase 3: Monitor, validate, then expand to 50%, then 100%
Phase 4: Delete old ConfigMaps/env vars once 100% migrated

Implementation:
Original deployment uses env vars: spec: containers: - name: app env: - name: DB_PASSWORD valueFrom: configMapKeyRef: name: db-creds key: password
Step 1: Deploy ExternalSecret alongside:
apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: db-credentials-vault spec: secretStoreRef: name: vault target: name: db-credentials-from-vault data: - secretKey: password remoteRef: key: prod/db-password
Step 2: Canary: 10% of pods use Vault secret
spec: containers: - name: app env: - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-credentials-from-vault # NEW: from Vault key: password
Step 3: Monitor error rates, database connection errors, latency
If issues: can revert by scaling back canary
Step 4: Gradually expand to 25%, 50%, 100%
Step 5: Delete old ConfigMaps
kubectl delete configmap db-creds

Validation during each phase:
- Application logs show no connection errors
- Database connection pool healthy
- Vault audit logs show secret access
- No increase in error rate

Follow-up: What happens if a pod becomes unable to reach Vault during this migration? Design a fallback mechanism.

You manage secrets for 50 microservices. Each service has 5-10 secrets (API keys, database credentials, OAuth tokens, etc.). Developers keep asking for access to specific secrets ("I need the Stripe key to debug payment processing"). You can't give them production access directly, but denying breaks their productivity. Design a secure but frictionless model.

Balance: security without friction. Developers need enough access to be productive, but with guardrails.

Model:
- Developers read staging secrets freely (low security risk)
- Production secrets: read-only access via audit logs
- Sensitive operations (modify secrets): require approval
- Access automatically revokes after TTL

Tiered access:
TIER 1: Developer (Staging) - Read all staging secrets - No TTL (permanent) - No approval needed

TIER 2: Developer (Production, Debug)



Read-only access to specific secrets
24-hour TTL
Reason required (audit trail)
No approval needed (auto-grant)

TIER 3: Developer (Production, Modify)

Change production secrets
1-hour duration
Requires approval from Ops
Fully audited

TIER 4: Ops/SRE (All)


Full access to all secrets
No TTL

Automatic audit logging

Implementation with Vault:
Dev can request staging password:


vault kv get secret/staging/myapp/db-password
Works immediately, no approval
Dev can request production password (read-only):
vault kv get secret/prod/myapp/db-password --ttl=24h
Works with TTL, logged, no approval (auto-grant policy)
Dev wants to rotate production password:
vault request change-secret secret/prod/myapp/db-password

`Requires approval from SRE, only then can they update`

Vault policy:
path "secret/staging/*" { capabilities = ["read", "list"]


Developers can read staging freely
}
path "secret/prod/*" {
capabilities = ["read", "list"]
Read-only production (no create/update)
ttl = "24h"
}
path "secret/prod/*/modify" {
capabilities = ["update"]
Requires approval

}

Monitoring suspicious patterns:
- Dev reading same secret 100x in 1 hour (exfiltration attempt?)
- Dev reading secrets outside their team
- Dev requesting modifications without reason
- Off-hours access patterns

Follow-up: How do you prevent developers from exfiltrating production secrets they're allowed to read?

Look for: --encryption-provider-config flag

If it’s a credential (api-key, password, token), mark it revoked

Copy permissions from old SA to temp-sa

Update all pod specs to use temp-sa

Eventually delete old SAs

Extract secrets from the backup (simulating attacker’s view)

Query the restored etcd for secrets

Count: how many secrets were exposed?

Did attacker actually access something, or just had the file?

DB: ALTER USER app_user IDENTIFIED BY 'new-password';

AWS: Create new IAM access key, revoke old one

Stripe: Create new API key, revoke old one

This forces etcd to re-encrypt all existing secrets with the new key

Shows data, but secrets are encrypted

Attacker has only encrypted blobs, not plaintext credentials

Encrypt backup with GPG

Enable S3 access logging

Monitor for unexpected downloads

Should see flag

If you can read plaintext values, secrets are NOT encrypted

Count: ~30,000 unencrypted secrets

Monitor rollout

This re-writes all secrets, forcing them to be encrypted with the fixed config

Should now show encrypted binary data, not plaintext

Should show binary/hex, not ASCII plaintext

Should show binary gibberish, not plaintext "value"

This policy verifies that secrets are actually encrypted before allowing creation

Save unseal keys securely!

This automatically syncs Vault secret → Kubernetes Secret

Pods can use K8s Secret as before, but data comes from Vault

Now all secret access is logged:

- Who accessed the secret

- When

- From what IP

- What action (read, create, delete)

Vault automatically:

1. Generates new password

2. Updates database

3. Updates secret in Vault

4. Pods pick up new secret via ExternalSecret refresh

DO NOT do this lightly - it breaks clones for everyone

But it’s necessary to remove secret from GitHub history

Option A: BFG Repo-Cleaner (simpler):

Option B: Git filter-branch (more control):

Force push to remote (WARNING: breaks all existing clones):

Should find nothing now

Any failed auth attempts from new IPs?

Check for unusual API calls

Look for: unauthorized DB access, IAM policy changes, new users created

Pod runs vault-agent sidecar

Vault-agent auto-refreshes secrets every 60 seconds

Application reads from templated file (updated in real-time)

With mTLS via service mesh:

- Certificates are managed by Istio, auto-rotated

- No application code changes needed

- Pods don’t restart, Istio handle cert refresh

- Connection pool handles transient failures during rotation

If pod fails liveness check (can’t connect with rotated secret),

Kubernetes automatically restarts it. On restart, it reads new secret from K8s Secret

(which was updated by ExternalSecret)

Breakglass request CRD

vault-breakglass request

List pending requests (for approvers)

Retrieve the credential (works for 30 min, then auto-expires)

Works immediately, no approval

Works with TTL, logged, no approval (auto-grant policy)

Requires approval from SRE, only then can they update

Developers can read staging freely

Read-only production (no create/update)

Requires approval