Kubernetes Interview Questions

RBAC and Admission Controllers

questions
Scroll to track progress

Friday 3 PM: A developer accidentally runs `kubectl delete namespace production` with the wrong context. The entire production namespace—all 200 deployments, services, and data—is deleted. By the time you notice (15 minutes later), all pods are terminating. How do you recover, and what guardrails do you put in place to prevent this?

This is a catastrophic human error scenario. Recovery depends on your etcd backup strategy and RBAC controls. Immediate response is critical.

Phase 1: Immediate recovery (first 5 minutes)
1. Check deletion timestamp: kubectl get namespace production -o yaml 2>&1 # If namespace still exists but terminating, you can recover
2. Check etcd backup recency:
ls -lh /var/lib/etcd/backup/ | tail -5 # Most recent backup?
3. If you have recent backup, restore etcd snapshot:
BACKUP_TIME=$(date -d '5 minutes ago' +%Y-%m-%d_%H-%M-%S) etcdctl snapshot restore /var/lib/etcd/backup/etcd-backup-$BACKUP_TIME.db \ --data-dir=/var/lib/etcd.bak # Don't restore in-place yet, use separate directory for testing
4. Test the restore on a test etcd instance:
etcdctl --endpoints=127.0.0.1:2379 get /kubernetes.io/namespaces/production # Verify data is there

Phase 2: Full recovery (5-30 minutes)
Option A: If etcd is still running (and you caught it early):
1. Prevent finalizers from completing: kubectl patch namespace production -p '{"metadata":{"finalizers":[]}}' --type=merge
2. This will prevent the namespace from fully deleting. Pods will stay terminating but not complete removal.

Option B: Restore from etcd backup (safer long-term solution):
1. Stop all kube-apiserver instances: kubectl scale deployment -n kube-system kube-apiserver --replicas=0
2. Backup current etcd:
cp -r /var/lib/etcd /var/lib/etcd.broken
3. Restore from backup:
rm -rf /var/lib/etcd mv /var/lib/etcd.bak /var/lib/etcd chown -R etcd:etcd /var/lib/etcd systemctl restart etcd
4. Restart kube-apiserver:
kubectl scale deployment -n kube-system kube-apiserver --replicas=3
5. Verify recovery:
kubectl get ns production kubectl get pods -n production | wc -l

Prevention: Multi-layer guardrails
Layer 1: RBAC - Restrict delete permissions on namespaces apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: developer rules: - apiGroups: [""] resources: ["pods", "services", "deployments"] verbs: ["get", "list", "create", "update", "patch"] - apiGroups: [""] resources: ["namespaces"] verbs: ["get", "list"] # Explicitly NO delete --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: production-admin rules: - apiGroups: [""] resources: ["namespaces"] resourceNames: ["production"] # Only production namespace verbs: ["*"] # Can only be granted to very few people
Layer 2: Admission Controller - Prevent namespace deletion apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: prevent-namespace-deletion webhooks: - name: prevent-ns-delete.internal clientConfig: service: name: webhook-service namespace: kube-system path: "/validate-ns-deletion" caBundle: ... rules: - operations: ["DELETE"] apiGroups: [""] apiVersions: ["v1"] resources: ["namespaces"] admissionReviewVersions: ["v1"] sideEffects: None timeoutSeconds: 2
Webhook logic:
def validate_namespace_deletion(request): namespace = request.namespace # Block deletion of protected namespaces if namespace in ["production", "kube-system", "kube-public"]: # Allow only if user has "allow-dangerous-namespace-delete" permission if not request.user.has_permission("allow-dangerous-namespace-delete"): return deny("Production namespace deletion requires explicit approval") return allow()
Layer 3: Audit Logging - Track dangerous operations apiVersion: audit.k8s.io/v1 kind: Policy rules: - level: RequestResponse omitStages: - RequestReceived resources: - group: "" resources: ["namespaces"] verbs: ["delete", "deleteCollection"] # Audit at highest level for namespace deletions
Layer 4: Notification - Alert on dangerous actions - alert: NamespaceDeletion expr: kubernetes_audit_delete_namespace == 1 for: 0m annotations: summary: "CRITICAL: Namespace deleted by {{ $labels.user }}" runbook: "IMMEDIATE ACTION: Check if this was intentional. If not, trigger recovery."
Layer 5: Context Management - Prevent wrong-cluster accidents # ~/.kube/config clusters: - cluster: server: https://prod-api.internal name: production-us-east-1 # Add cluster-specific context naming contexts: - context: cluster: production-us-east-1 namespace: default user: production-admin name: prod-us-east-1-admin - context: cluster: staging-us-west-1 namespace: default user: staging-admin name: staging-us-west-1-admin

Use explicit naming: kubectl --context prod-us-east-1-admin delete ns …

Avoid ambiguous context names

Layer 6: Git-based recovery

Keep all resources in Git

git clone https://github.com/company/kubernetes-manifests cd kubernetes-manifests/production

Re-apply everything

kubectl apply -R -f .

With GitOps (ArgoCD), recovery is automatic

kubectl apply -f argocd-application.yaml

ArgoCD continuously reconciles desired state from Git

Follow-up: How do you design a recovery procedure that your team can execute under stress (weekend, 3 AM)? Walk through a runbook.

You have three user personas on your cluster: developers (need to deploy to staging), SREs (need cluster-wide observability), and platform engineers (full cluster access). A developer currently has too much power: they can accidentally (or maliciously) read secrets from other namespaces or delete critical infrastructure. Design a least-privilege RBAC model that's maintainable for 50+ teams.

RBAC design at scale requires hierarchy, templating, and clear role separation.

Phase 1: Define role hierarchy
Level 1 (Namespace-scoped): Team developers - Can: deploy, scale, restart pods in their namespace - Cannot: access secrets, change RBAC, delete namespace

Level 2 (Namespace-scoped): Team lead / Platform on-call

  • Can: do everything Level 1 + access logs, view secrets, restart services
  • Cannot: change RBAC, delete namespace, scale to zero

Level 3 (Cluster-scoped): SRE / Cluster Admin

  • Can: do everything
  • Cannot: accidentally delete critical system namespaces (via admission control)

    Phase 2: Template-based RBAC
    apiVersion: v1

kind: Namespace metadata: name: team-data-platform labels: team: data-platform tier: production

Auto-generate these roles via templating

apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: developer namespace: team-data-platform rules:

Deployments

  • apiGroups: ["apps"] resources: ["deployments", "statefulsets"] verbs: ["get", "list", "watch", "update", "patch"]

    Can modify but not create/delete

  • apiGroups: ["apps"] resources: ["deployments/scale"] verbs: ["update", "patch"]

Pods

  • apiGroups: [""] resources: ["pods", "pods/logs"] verbs: ["get", "list", "watch", "delete"] # Can restart via delete
  • apiGroups: [""] resources: ["pods/portforward"] verbs: ["create"]

ConfigMaps (but NOT secrets)

  • apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "watch", "create", "update", "patch"]

NO access to secrets, RBAC, or namespace itself


apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: team-lead namespace: team-data-platform rules:

Everything the developer can do, plus:

  • apiGroups: [""] resources: ["secrets"] verbs: ["get", "list"]

    Read-only access to secrets

  • apiGroups: ["rbac.authorization.k8s.io"] resources: ["rolebindings"] verbs: ["get", "list"]

    Can see who has access, but not change


apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: sre-observability rules:

View everything cluster-wide

  • apiGroups: [""] resources: ["*"] verbs: ["get", "list", "watch"]

Execute commands only in kube-system and monitoring

  • apiGroups: [""] resources: ["pods/exec"] verbs: ["create"] namespaces: ["kube-system", "monitoring"]

Access metrics

  • apiGroups: ["metrics.k8s.io"] resources: ["*"] verbs: ["get", "list"]

    Phase 3: Automate role binding via namespace labels
    Create a controller that auto-assigns roles based on namespace labels:

for each namespace: if labels["team"] == "data-platform": ensure RoleBinding exists: roleRef: Role/developer subjects: Group/data-platform-developers

if labels["tier"] == "production": add constraint: "minimum 2 team leads must approve changes"

Phase 4: Cross-namespace access patterns
Pattern A: SRE needs to view all pods across all namespaces apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: sre-read-all rules:

  • apiGroups: [""] resources: ["pods", "services", "events"] verbs: ["get", "list", "watch"]

    No namespace restriction = all namespaces

Pattern B: Developer needs access to shared services namespace apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: shared-services-consumer namespace: shared-services rules:

  • apiGroups: [""] resources: ["services"] verbs: ["get", "list"]

apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: dev-team-access namespace: shared-services roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: shared-services-consumer subjects:

  • kind: Group name: data-platform-developers apiGroup: rbac.authorization.k8s.io

    Phase 5: Audit and validation
    1. Verify effective permissions:

kubectl auth can-i create deployments --as=system:serviceaccount:team-data-platform:developer -n team-data-platform

yes

kubectl auth can-i delete secrets --as=system:serviceaccount:team-data-platform:developer -n team-data-platform

no

  1. List all roles in a namespace: kubectl get roles,rolebindings -n team-data-platform

  2. Find who has access to secrets: kubectl get clusterroles,clusterrolebindings -A -o json |
    jq '.items[] | select(.rules[]? | select(.resources[]? | contains("secrets"))) | .metadata.name'

    Phase 6: Common pitfalls and fixes
    Pitfall 1: Overly permissive wildcards
    BAD: resources: ["*"], verbs: ["*"] # Grants everything

GOOD: resources: ["pods", "services"], verbs: ["get", "list", "watch"]
Pitfall 2: Forgetting resourceNames restriction
BAD: resources: ["secrets"], verbs: ["get"] # Can read ANY secret GOOD: resources: ["secrets"], resourceNames: ["app-config"], verbs: ["get"] # Only specific secret
Pitfall 3: Implicit admin access
BAD: Giving cluster-admin to every SRE GOOD: Give minimal roles, and use temporary elevated access (via just, sudo-like tools)

Follow-up: How would you handle a developer who needs temporary elevated access (e.g., to debug production)? Design an approval and auditing workflow.

You're implementing a ValidatingWebhook that enforces: "All production pods must have resource requests and limits." A bug in the webhook causes it to reject ALL pods cluster-wide (including system pods). Your cluster becomes unusable—no new pods can be created. How do you debug and recover?

A broken ValidatingWebhook can completely freeze a cluster. Recovery requires careful steps to restore functionality without triggering the buggy webhook again.

Phase 1: Immediate diagnosis
1. Check webhook status: kubectl get validatingwebhookconfigurations kubectl describe validatingwebhookconfig enforce-resource-limits
2. Check webhook logs:
kubectl logs -n kube-system -l app=resource-webhook --tail=100
Expected error: "rejected call to webhook" or similar
3. Check audit logs:
kubectl logs -n kube-system kube-apiserver-* | grep 'webhook'

Phase 2: Immediate recovery (option A - disable webhook)
1. Delete the webhook: kubectl delete validatingwebhookconfig enforce-resource-limits
2. If you can't even create that deletion command, the webhook blocks it. In that case, edit the ValidatingWebhookConfiguration directly on etcd or kube-apiserver level.
3. Verify recovery:
kubectl run test-pod --image=alpine # Should succeed now

Phase 3: Recover with caution (option B - modify webhook temporarily)
1. Patch webhook to only apply to specific namespaces (exclude system): kubectl patch validatingwebhookconfig enforce-resource-limits --type='json' -p='[ {"op": "replace", "path": "/webhooks/0/namespaceSelector", "value": {"matchLabels": {"enforce-limits": "true"}, "matchExpressions": [ {"key": "name", "operator": "NotIn", "values": ["kube-system", "kube-public"]} ]} } ]'
2. This prevents the webhook from affecting system namespaces, reducing blast radius.
3. Now you can deploy fixes more safely.

Phase 4: Root cause analysis
1. Check webhook source code for bugs: cat webhook.py | grep -A 10 'def validate' # What's the actual validation logic?

  1. Look for:
  • Incorrect JSON schema matching
  • Typos in field names
  • Logic inversion (deny when should allow)
  • Unhandled exceptions (defaulting to reject)

    Phase 5: Proper webhook design to prevent this
    apiVersion: admissionregistration.k8s.io/v1

kind: ValidatingWebhookConfiguration metadata: name: enforce-resource-limits webhooks:

  • name: enforce.resources.internal clientConfig: service: name: resource-webhook namespace: kube-system path: "/validate-resources" failurePolicy: Ignore # KEY: If webhook fails, don’t block cluster sideEffects: None timeoutSeconds: 2 namespaceSelector: matchLabels: enforce-limits: "true" # Only apply to opted-in namespaces rules:
    • operations: ["CREATE"] apiGroups: ["apps", ""] apiVersions: ["v1", "v1beta1"] resources: ["deployments", "pods", "statefulsets"] admissionReviewVersions: ["v1"]

      Key safeguards:
      1. `failurePolicy: Ignore` - If webhook crashes, allow the request
      2. `namespaceSelector` - Only apply to specific namespaces
      3. `timeoutSeconds: 2` - Kill slow webhooks
      4. Exclude system namespaces from webhook logic

      Phase 6: Webhook implementation best practices
      def validate_resources(admission_review): try: pod = admission_review.request.object

      # Skip validation for system namespaces if pod.metadata.namespace in ["kube-system", "kube-public", "kube-node-lease"]: return allow()

      # Only validate production namespace if not pod.metadata.namespace.startswith("prod-"): return allow()

      # Check for resource requests/limits for container in pod.spec.containers: if not container.resources.requests or not container.resources.limits: return deny("Missing resource requests/limits")

      return allow()

      except Exception as e: # Log but don’t block on error (fail open) logger.error(f"Webhook error: {e}") return allow() # IMPORTANT: fail open, not fail closed

      Phase 7: Testing before deployment
      1. Unit tests for webhook logic:

def test_allows_pod_with_resources(): pod = Pod(resources={"requests": {}, "limits": {}}) result = validate_resources(pod) assert result == allow()

def test_allows_system_namespace(): pod = Pod(namespace="kube-system") result = validate_resources(pod) assert result == allow()

def test_denies_pod_without_resources(): pod = Pod(resources={}) result = validate_resources(pod) assert result == deny()

  1. Integration tests in staging cluster: kubectl create namespace test-webhook-staging

Deploy webhook to staging

Try various pod types and verify behavior

  1. Canary deployment: Deploy webhook to only 10% of namespaces first Monitor for errors Gradually increase scope

Follow-up: How would you deploy an admission webhook update that changes behavior in a backward-incompatible way? Design a safe migration strategy.

Your MutatingWebhook (for sidecar injection) modifies 90% of pod requests. It's working correctly, but you notice in metrics that API server latency spiked 40% after deployment. The webhook is slow. How do you troubleshoot, optimize, and avoid performance degradation?

Admission webhooks run synchronously in the API server critical path. A slow webhook blocks cluster operations. This is a performance emergency.

Phase 1: Identify the bottleneck
1. Measure webhook latency: kubectl logs -n kube-system kube-apiserver-* | grep 'webhook' | grep -E 'latency|duration' # Look for timestamps and request duration
2. Check webhook service endpoints:
kubectl get endpoints -n kube-system resource-webhook # Is it ready? How many replicas?
3. Check webhook pod logs for errors/slowness:
kubectl logs -n kube-system -l app=webhook --all-containers=true --tail=500 | tail -50
Look for: network timeouts, database queries, API calls

Phase 2: Quick fixes (immediate, <5 minutes)
1. Increase webhook replicas: kubectl scale deployment resource-webhook -n kube-system --replicas=5 # More parallelism = faster overall
2. Increase webhook's resource limits:
kubectl set resources deployment resource-webhook -n kube-system --limits=cpu=2,memory=1Gi --requests=cpu=500m,memory=256Mi
3. Adjust timeoutSeconds to fail faster:
kubectl patch validatingwebhookconfig enforce-resource-limits --type='json' -p='[ {"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 1} ]' # Reduce from default 30s
4. Measure improvement:
watch -n 1 'kubectl top pods -n kube-system -l app=webhook' # Monitor CPU/memory usage kubectl top nodes # Check if bottleneck is node-level

Phase 3: Root cause analysis
1. Profile webhook code with pprof: kubectl port-forward -n kube-system svc/resource-webhook 6060:6060 & curl http://localhost:6060/debug/pprof/profile?seconds=30 > webhook.prof go tool pprof webhook.prof # top10 to see hottest functions
2. Check for external dependencies: - Database queries? (Add connection pooling) - API calls? (Add caching) - File I/O? (Use ramdisk or local cache) - Regex evaluation? (Pre-compile patterns)

Phase 4: Optimize webhook logic
BEFORE (slow): def validate_pod(request): # Hit API server for every pod for label in pod.metadata.labels: user = apiserver.get_user(label) # SLOW: API call per label if not user.is_authorized(): return deny() return allow()

AFTER (fast): def validate_pod(request): # Cache user authorization data authorized_labels = cache.get("authorized-labels", ttl=60s) if not authorized_labels: authorized_labels = apiserver.list_users() cache.set("authorized-labels", authorized_labels, ttl=60s)

for label in pod.metadata.labels: if label not in authorized_labels: return deny() return allow()

Phase 5: Use objectSelector to reduce webhook invocations
apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: enforce-resource-limits webhooks:

  • name: enforce.resources.internal clientConfig: service: name: resource-webhook namespace: kube-system path: "/validate-resources"

    Only trigger for pods with this label

    objectSelector: matchLabels: validate-resources: "true"

    Instead of validating ALL 90% of pods, only validate opt-in ones

    This reduces webhook invocations from 90% to maybe 5-10%.

    Phase 6: Use webhooks sparingly
    1. Move validation to MutatingWebhook instead of ValidatingWebhook when possible

MutatingWebhooks run first and can fix problems before validation

  1. Use namespace-level filtering: namespaceSelector: matchLabels: webhook: enabled

Only run webhook in production namespaces, not dev/staging

  1. Defer webhook to background controller:
  • Webhook does minimal validation (fast path)
  • Background controller reconciles in slower time

    Phase 7: Caching and performance tuning
    1. Cache admission decisions:

If pod A was admitted, and pod B has identical structure, reuse decision

Cache key: hash(pod.spec)

cache = {} def validate_pod(pod): key = hash(pod.spec) if key in cache: return cache[key] # Fast hit

decision = expensive_validation(pod) cache[key] = decision return decision

  1. Batch webhook calls: Instead of 100 single webhook requests, group them and process in batches

  2. Use async webhooks for non-blocking validations:

  • Synchronous: block pod creation until decision

  • Asynchronous: create pod, validate in background, trigger alert if invalid

    Monitoring to prevent future incidents:
    - alert: WebhookLatency expr: histogram_quantile(0.95, kube_apiserver_webhook_latency_seconds_bucket) > 1 for: 5m annotations: summary: "Webhook p95 latency is {{ $value }}s" runbook: "Troubleshoot webhook performance"

  • alert: WebhookErrors expr: rate(kube_apiserver_webhook_errors_total[5m]) > 0.01 annotations: summary: "Webhook error rate {{ $value }}/sec"

Follow-up: Design a webhook that doesn't block the cluster even if it fails or becomes slow. What's the architecture?

You're implementing RBAC for a multi-team cluster. Developer Alice from Team A should NOT see Secrets in Team B's namespace. But she found a way to view them: `kubectl exec` into a Team B pod, then read the mounted secret from inside the container. Your RBAC doesn't prevent this. How do you secure it?

This is a fundamental RBAC gap: RBAC controls Kubernetes API access, but doesn't control what happens inside containers. Once a pod is running, the developer can access whatever is mounted or exposed via environment variables.

Problem analysis:
1. Current RBAC (insufficient): apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: team-a-developer namespace: team-a rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "create"] # This allows kubectl exec into pods # Once inside, no RBAC prevents reading mounted secrets

Solution Layer 1: Restrict pod/exec access by RBAC
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: team-a-developer namespace: team-a rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "create"] # NO exec permission - apiGroups: [""] resources: ["pods/exec"] verbs: [] # Explicitly deny --- # Only SREs can exec into pods apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: sre-pod-exec rules: - apiGroups: [""] resources: ["pods/exec"] verbs: ["create"]

Solution Layer 2: Service mesh for secrets injection
Don't mount secrets in environment variables. Use service mesh to manage credentials:
apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: production spec: mtls: mode: STRICT # Enforce mTLS --- # mTLS certificates are managed by service mesh, not mounted as secrets # Even if Alice execs into the pod, certificates are in-memory and not readable

Solution Layer 3: Use admission controller to prevent privileged exec
apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingWebhookConfiguration metadata: name: restrict-exec webhooks: - name: restrict-exec.internal clientConfig: service: name: exec-restrictor namespace: kube-system rules: - operations: ["CREATE"] apiGroups: [""] apiVersions: ["v1"] resources: ["pods/exec"] namespaceSelector: matchLabels: tier: production
Webhook logic:
def validate_exec(request): pod_namespace = request.object.metadata.namespace user = request.user

# Allow exec only if user has explicit permission if not user.has_permission("pods/exec", pod_namespace): return deny("exec requires explicit RBAC permission")

# Further: allow exec only into debug containers pod_labels = request.object.spec.labels if "debug" not in pod_labels: return deny("exec only allowed on debug pods")

return allow()

Solution Layer 4: Encrypt secrets at rest + in-transit
1. Encrypt secrets at rest in etcd: apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources:

  • resources:
    • secrets providers:
    • aescbc: keys:
      • name: key1 secret:
    • identity: {}

Secrets are encrypted in etcd, so even if Alice accesses the node storage, she can’t read them

2. Use external secrets manager:
apiVersion: v1 kind: ExternalSecret metadata: name: db-credentials namespace: team-a spec: secretStoreRef: name: vault kind: ClusterSecretStore target: name: db-credentials-secret creationPolicy: Owner data:

  • secretKey: password remoteRef: key: secret/data/team-a/db-password

Secrets are stored in HashiCorp Vault, not in Kubernetes

Vault handles access control and auditing

Solution Layer 5: Pod security policies / PSA (Pod Security Admission)
apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted spec: privileged: false runAsUser: rule: "MustRunAsNonRoot" seLinux: rule: "MustRunAs" fsGroup: rule: "MustRunAs" readOnlyRootFilesystem: true

Prevent Alice from running container as root, which might let her modify things

Solution Layer 6: Network policies to isolate teams
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: team-a-isolation namespace: team-a spec: podSelector: {} policyTypes:

  • Ingress
  • Egress ingress:
  • from:
    • podSelector: {} # Only from team-a pods egress:
  • to:
    • podSelector: {} # Only to team-a pods
  • to:
    • namespaceSelector: matchLabels: name: kube-system ports:
    • protocol: UDP port: 53 # DNS only

Even if Alice execs into a Team B pod, she can’t reach Team A services

Comprehensive fix (all layers):
1. RBAC: No pods/exec permission for developers 2. Admission: Validate exec attempts against policy 3. Secrets: Use external secrets manager (Vault, Sealed Secrets) 4. Network: Isolate teams with NetworkPolicies 5. Pod Security: Restrict pod capabilities 6. Encryption: Encrypt secrets at rest and in-transit 7. Audit: Log all exec attempts and secret accesses

Follow-up: Design a "break glass" access procedure where Alice can get temporary access to Team B's secrets in emergencies. How do you audit and revoke this?

You're migrating from RBAC-based access control to OIDC + attribute-based access control (ABAC). Your goal: developers from different teams automatically get the right permissions based on their OIDC claims (team: data-platform, role: engineer). How do you design and test this migration without breaking cluster access?

Migration from static RBAC to dynamic OIDC-based access requires careful planning. One wrong move and you lock out all users.

Phase 1: Architecture design
OIDC Provider sends claims: { "sub": "alice@company.com", "groups": ["data-platform-engineers"], "team": "data-platform", "role": "engineer" }

Kubernetes reads these claims and maps to RBAC roles: group: "data-platform-engineers" → RoleBinding in team-data-platform namespace

ClusterRoleBinding in Kubernetes: roleRef: ClusterRole/engineer subjects:

  • kind: Group name: "data-platform-engineers" apiGroup: rbac.authorization.k8s.io

    Phase 2: Setup OIDC on kube-apiserver
    kube-apiserver --oidc-issuer-url=https://auth.company.com \ --oidc-client-id=kubernetes \ --oidc-username-claim=email \ --oidc-groups-claim=groups \ --oidc-groups-prefix=oidc: \ --oidc-ca-file=/etc/kubernetes/oidc-ca.pem

    Phase 3: Create role mappings
    apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding metadata: name: data-platform-engineers roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: developer subjects:


Create ClusterRole for engineers

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: developer rules:

  • apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "watch", "update", "patch"]
  • apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"]

    Phase 4: Test migration (pilot with one user)
    1. Configure kubectl to use OIDC:

kubectl config set-cluster k8s --server=https://api.internal kubectl config set-credentials alice-oidc
–exec-api-version=client.authentication.k8s.io/v1beta1
–exec-command=kubectl-oidc-login
–exec-arg=–token-url=https://auth.company.com/token
–exec-arg=–client-id=kubernetes

  1. Test with pilot user Alice: kubectl --user=alice-oidc get pods

Should work if OIDC is configured correctly

  1. Verify Alice’s groups are recognized: kubectl auth can-i list deployments --as=alice@company.com --as-groups=oidc:data-platform-engineers

yes

Phase 5: Parallel running (old and new simultaneously)
Keep existing certificate-based RBAC running while testing OIDC:

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: developers-old-cert # Old cert-based roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: developer subjects:


apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: developers-new-oidc # New OIDC-based roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: developer subjects:

Now both cert and OIDC work simultaneously

Developers can test OIDC without breaking their cert-based access

Phase 6: Cutover with escape hatch
1. Week 1-2: Announce OIDC is available, teams test voluntarily 2. Week 3: Require all new access to use OIDC (enable new hires with OIDC only) 3. Week 4: Deprecate cert-based access 4. Week 5: Full cutover, remove cert-based RoleBindings

But ALWAYS keep escape hatch:

  • One static ServiceAccount with admin permissions
  • Used only in emergency (all OIDC broken)

    Phase 7: Group synchronization from OIDC
    Use a controller to keep RBAC in sync with OIDC groups:

apiVersion: v1 kind: ConfigMap metadata: name: oidc-group-mappings data: mappings: | data-platform-engineers: role: developer namespace: team-data-platform

sre-oncall: role: sre clusterwide: true

Controller logic: for each OIDC group: if role != current_rolebinding: update_rolebinding(group, role) if namespace != current_target: update_namespace(group, namespace)

Phase 8: Audit and monitoring
- alert: OIDCAuthFailure expr: rate(apiserver_authentication_failure_total{type="oidc"}[5m]) > 0.01 annotations: summary: "OIDC auth failures detected"

kubectl audit logs should capture: { "user": { "username": "alice@company.com", "uid": "oidc-uuid", "groups": ["oidc:data-platform-engineers"] }, "verb": "get", "objectRef": { "resource": "pods" } }

Common pitfalls:
1. Lockout: If OIDC is misconfigured, all users get locked out. Always test with pilot first.
2. Group claim missing: If OIDC provider doesn’t send groups claim, group-based RBAC fails.
3. Certificate still preferred: If both cert and OIDC are configured, certificate takes precedence. Explicitly remove old certs during cutover.
4. Custom claims: Different OIDC providers use different claim names (groups, team, org). Map them in kube-apiserver config.

Follow-up: How would you handle a situation where an OIDC provider goes down? Design a failover mechanism that keeps the cluster accessible.

You have 200 microservices on your cluster. A security audit requires: "All ServiceAccounts must have minimal permissions." Currently many use wildcard permissions. Manually fixing each one would take months. Design an automated approach to convert your cluster to least-privilege.

Large-scale RBAC migration requires automation and careful verification. Manual approach won't scale.

Strategy:
1. Audit current permissions: Export all roles/rolebindings to inventory
2. Identify highest-risk ServiceAccounts (those with wildcards, secrets access)
3. Test least-privilege changes in staging first
4. Gradual rollout with canary approach
5. Monitor for permission errors, iterate

Implementation:
apiVersion: batch/v1 kind: CronJob metadata: name: rbac-migration spec: schedule: "0 0 * * 1" # Weekly jobTemplate: spec: template: spec: containers: - name: migrate image: rbac-migrator:latest env: - name: BATCH_SIZE value: "10" # Migrate 10 at a time command: - /bin/sh - -c - | # Find ServiceAccounts with wildcard permissions kubectl get clusterrolebindings -o json | \ jq '.items[] | select(.subjects[].kind=="ServiceAccount") | select(.roleRef.name | contains("admin"))' | \ head -$BATCH_SIZE | \ while read sa; do # Check if migration test passed # If yes, apply new least-privilege binding # If no, skip for manual review done

Least-privilege template library:
Create pre-built minimal roles for common services (databases, logging, monitoring)
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: app-reader rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "watch"]

Verification and monitoring:
- Permission denied errors tracked
- Permission audited daily
- Rollback possible for failed migrations

Follow-up: How would you test least-privilege permissions before deploying them to production without breaking existing services?

You're implementing RBAC for a platform serving 100+ teams. Each team needs: namespace isolation, role inheritance (Lead can do what Developer can + more), and audit trails. Manual RBAC config per team is unmaintainable. Design a templated, scalable RBAC system.

Scalable RBAC requires templates, automation, and clear hierarchy. Can't manually create 300+ roles (100 teams x 3 roles each).

Design:
1. Define role hierarchy: Developer < Lead < Admin
2. Template RBAC generation (using Helm, Kustomize, or custom controller)
3. Bootstrap per-team namespace with standard roles
4. Use labels and selectors to manage at scale

Implementation with Helm template:
apiVersion: v1 kind: Namespace metadata: name: team-{{ teamName }} labels: team: {{ teamName }} --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: developer namespace: team-{{ teamName }} rules: - apiGroups: ["apps"] resources: ["deployments", "statefulsets"] verbs: ["get", "list", "watch", "create", "update", "patch"] - apiGroups: [""] resources: ["pods", "pods/logs"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: team-lead namespace: team-{{ teamName }} rules: - apiGroups: ["*"] resources: ["*"] verbs: ["get", "list", "watch"] # Team lead: everything developer can do + read everything + debug --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: developers namespace: team-{{ teamName }} roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: developer subjects: - kind: Group name: "team-{{ teamName }}-developers" --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: team-leads namespace: team-{{ teamName }} roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: team-lead subjects: - kind: Group name: "team-{{ teamName }}-leads"

Automation: Controller that creates RBAC on namespace creation
When namespace with label team=X is created:
- Instantiate role template for team-X
- Create role bindings linking OIDC groups
- Sync audit logging

Cross-team sharing:
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: shared-services-consumer namespace: shared-services rules: - apiGroups: [""] resources: ["services", "endpoints"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: all-teams-read namespace: shared-services roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: shared-services-consumer subjects: - kind: Group name: "all-developers" # Matches all team-X-developers groups

Audit and compliance reporting:
- Monthly report: who has access to what
- Permission drift detection
- Unused roles cleanup
- Quarterly review for least-privilege validation

Follow-up: How would you onboard a new team quickly while ensuring they follow security best practices?

Want to go deeper?