You're a platform engineer onboarding 50 teams (Finance, HR, Marketing, Engineering) onto a shared Kubernetes cluster. Each team needs independent deployments, but you must prevent: team A's workload from seeing team B's secrets, team A consuming all cluster CPU, team A modifying team B's deployments. Design the multi-tenancy isolation model from scratch. You have 1 week.
Kubernetes multi-tenancy has three layers: logical (namespaces), policy (RBAC/NetworkPolicy), and resource (quotas/limits). A robust model uses all three. Architecture: One namespace per team (team-finance, team-marketing, etc.). Within namespace: Deployments, Services, Secrets, ConfigMaps all isolated by default (no cross-namespace access without explicit permission).
Layer 1 - Namespace isolation: Create namespaces: kubectl create namespace team-finance. Kubernetes isolates by default: a service in team-finance cannot reach a service in team-hr without explicit NetworkPolicy. Secrets in one namespace are invisible to pods in another (etcd is single store, but RBAC prevents read).
Layer 2 - RBAC: Each team gets a service account and role: kubectl create sa team-finance -n team-finance. Bind a role allowing only that namespace: kind: Role, metadata.namespace: team-finance, rules: [{apiGroups: [""], resources: ["pods", "services"], verbs: ["get", "list", "create", "update"]}]. Bind with RoleBinding (namespace-scoped). This ensures team member can't exec into pods in other namespaces or view team-hr deployments.
Layer 3 - Resource quotas: Apply namespace quota: kind: ResourceQuota, spec: {hard: {requests.cpu: "50", requests.memory: "100Gi", pods: "200"}} on team-finance namespace. This caps team's CPU usage at 50 cores and memory at 100GB. If exceeded, new pods rejected. Also set LimitRange: limits: [{max: {cpu: "4", memory: "8Gi"}, min: {cpu: "100m", memory: "128Mi"}, type: "Pod"}] to prevent one pod from using all 50 cores.
Layer 4 - NetworkPolicy: By default, Kubernetes is all-open (pods can reach any pod). Apply deny-all egress: kind: NetworkPolicy, spec: {podSelector: {}, policyTypes: ["Ingress", "Egress"], egress: []} in each namespace to deny all inter-namespace traffic. Then allow only necessary: egress: [{to: [{namespaceSelector: {matchLabels: {name: "kube-system"}}}], ports: [{port: 53, protocol: UDP}]}]—allow only DNS queries to kube-system. This prevents team-finance pods from reaching team-hr pods.
Implementation: Use a template for every team: 1 Namespace + 1 ServiceAccount + 1 Role + 1 RoleBinding + 1 ResourceQuota + 1 LimitRange + 2 NetworkPolicies (deny-all + allow-dns). Automate with Helm or Kustomize. Provide teams self-service: publish template, teams submit PR with team name, platform team merges (runs terraform or kustomize to apply).
Gotcha prevention: (1) Don't use NetworkPolicy alone—it's not enforced by all CNIs (flannel doesn't support it, calico/cilium do). Verify your CNI: kubectl get nodes -o wide or kubectl get daemonset -n kube-system. (2) Secrets are readable by cluster admins (etcd has plaintext). Use a secrets management system (HashiCorp Vault, AWS Secrets Manager) and mount read-only via init container. (3) PersistentVolumes are cluster-wide—explicitly grant StorageClass access per namespace with RBAC.
Follow-up: Team Finance accidentally creates a ServiceAccount with cluster-admin role. How would you detect this and what automated guardrails would you put in place?
A developer from team-marketing accidentally (or maliciously) runs: kubectl get secrets -A and sees all 50 teams' database passwords. RBAC should have blocked this, but didn't. Audit log shows they used their personal kubeconfig. Your CEO is asking what happened.
RBAC failed because the user's kubeconfig likely has cluster-admin role (common mistake: dev gets full access to debug, then retains access forever). Or the developer has get secrets permission in their role but without resource name restriction: rules: [{apiGroups: [""], resources: ["secrets"], verbs: ["get"]}] allows reading ANY secret.
Audit the incident: Enable audit logging on API server: --audit-policy-file=/etc/kubernetes/audit-policy.yaml --audit-log-path=/var/log/kubernetes/audit.log. Check logs: grep "secrets.*-A\|get.*secrets.*v1" /var/log/kubernetes/audit.log. Look for the user identity: "user": {"username": "dev@company.com"}. Then check RBAC: kubectl get rolebindings -A | grep dev@company.com and kubectl get clusterrolebindings | grep dev@company.com. If they have cluster-admin, that's the root cause.
Investigation steps: (1) Who provisioned this access? Check git history for kubeconfig distribution or IAM assignments. (2) How long have they had access? Audit logs will show—if 6 months, they might have accessed other secrets before detection. (3) Did they exfiltrate? Check egress logs (network flow logs) to see if they scp'd secrets to external server.
Immediate containment: (1) Revoke the kubeconfig: kubectl delete clusterrolebinding dev-admin or remove from RoleBinding. (2) Rotate all secrets they accessed: database passwords, API keys, certificates. (3) Invalidate their client certificate: kubectl config unset users.dev-user or delete from kube-apiserver's known_tokens file.
Prevention architecture: (1) Use role-based kubeconfig generation. Instead of distributing static kubeconfig, use kubectl auth can-i get secrets to test permissions before issuing access. (2) Implement namespace-scoped roles: kind: Role, rules: [{apiGroups: [""], resources: ["secrets"], verbs: ["get"], resourceNames: ["my-secret-only"]}]—this allows reading only a specific secret. (3) Use API server webhook for audit: --audit-webhook-config-file sends audit logs to external SIEM (Datadog, Splunk). Alert on get secrets -A: if any user runs this, immediate notification. (4) Encrypt secrets at rest: --encryption-provider-config with AES encryption—even if someone steals etcd, secrets are encrypted.
Follow-up: You've encrypted secrets at rest and implemented audit logging. Now you need to ensure developers can rotate their own secrets without needing cluster-admin. What mechanism would you use?
Team A's batch job runs for i in 1..10000; do kubectl create pod pod-$i; done and consumes 90% of cluster CPU in 60 seconds, causing all other teams' apps to become unresponsive. Your ResourceQuota wasn't effective. Why?
ResourceQuota limits total resource requests, but CPU throttling is orthogonal. If team A's pods have requests.cpu: 1m (trivial), 10K pods only consume 10 CPU requests. But actual CPU usage (not requests) can spike. ResourceQuota doesn't enforce actual usage, only requests+limits.
Root cause: Pod created without CPU limits. Check pod: kubectl get pods -n team-a -o yaml | grep -A 5 resources—if no limits.cpu, the pod is unbounded. When it runs CPU-intensive workload, it grabs all available cores (Linux scheduling gives leftover CPU to any process that wants it). With 10K pods running concurrently, cluster CPU is saturated.
Verify: Run kubectl top pods -n team-a --sort-by=cpu—you'll see each pod consuming massive CPU (e.g., 2 cores each). Sum them: 10K pods * 2 cores = 20K cores—impossible on a 64-core cluster, so scheduler queue backs up and everything stalls.
Quick fix: Add LimitRange to team-a namespace with limits.cpu: 500m per pod max: kind: LimitRange, spec: {limits: [{max: {cpu: "500m"}, type: "Pod"}]}. This caps each pod's CPU at 0.5 cores. Now 10K pods use max 5K cores total, still saturating cluster but more predictable. Existing pods must be deleted and recreated to inherit new limit.
Long-term: Implement CPU request enforcement. Require every pod to have requests: requests.cpu: 100m and limits. Validate with admission webhook: reject pods without limits. Use Kubewarden or OPA Gatekeeper: violation {not input.request.object.spec.containers[_].resources.limits.cpu}—blocks any pod without CPU limit.
Also add QoS class awareness: pods with Requests == Limits get QoS.Guaranteed (highest priority, last to evict). Pods with Requests < Limits get Burstable (medium priority). Pods with no Requests get BestEffort (evicted first). Configure kubelet eviction to evict BestEffort first: --eviction-hard=memory.available<100Mi --eviction-sort-order=priority. This ensures team A's unbounded pods are killed first when cluster is under pressure.
Follow-up: You've added LimitRange and admission webhooks. Team A creates 10K pods with 500m CPU each, consuming exactly their quota. But they're still starving other teams. What's not accounted for?
Your 50-team cluster uses NetworkPolicy to prevent cross-namespace traffic. Team Finance needs to pull a Docker image from an internal registry (team-infra maintains it). The registry is in kube-system namespace. Finance team's NetworkPolicy is blocking the registry access. You get 50 support tickets.
NetworkPolicy is too restrictive by default. The deny-all policy blocks even egress to cluster DNS and image registries. When kubelet tries to pull image for team-finance pod, the pod's network interface can't reach the registry IP.
Clarify the issue: Is the registry inside the cluster (in kube-system) or external (e.g., docker.io)? If internal: DNS query for registry.kube-system.svc.cluster.local succeeds (DNS uses port 53), but TCP 443 (HTTPS) to registry pod is blocked by NetworkPolicy. If external: egress to external IPs is blocked entirely.
Solution for internal registry: Allow egress from team-finance to kube-system registry service: egress: [{to: [{namespaceSelector: {matchLabels: {name: "kube-system"}}, podSelector: {matchLabels: {app: "registry"}}}], ports: [{port: 443, protocol: TCP}]}]. This explicitly allows team-finance pods to reach registry pods in kube-system on port 443. Apply to all team namespaces.
Solution for external registry: Allow egress to external IP ranges: egress: [{to: [{ipBlock: {cidr: "0.0.0.0/0", except: ["10.0.0.0/8", "169.254.0.0/16"]}}], ports: [{port: 443}, {port: 80}]}]—allows external internet (0.0.0.0/0) except private IPs and link-local. This lets teams pull from docker.io but blocks internal IPs.
Automate: Create a shared NetworkPolicy template for all teams that includes: (1) Allow DNS to kube-system: {to: [{namespaceSelector: {matchLabels: {name: "kube-system"}}, podSelector: {matchLabels: {k8s-app: "kube-dns"}}}]}. (2) Allow metrics scraping from monitoring namespace: {to: [{namespaceSelector: {matchLabels: {name: "monitoring"}}}]} for Prometheus. (3) Allow external egress for docker.io and other approved registries. Provide as base template so all teams inherit these rules automatically.
Gotcha: Some image pulls use kubelet directly (not from inside pod). Kubelet runs on host network and isn't constrained by pod NetworkPolicy. This means kubelet can always reach registries regardless of pod policy. But kubelets may be behind corporate proxies—configure HTTP_PROXY in kubelet env: --env=HTTP_PROXY=http://proxy.company:8080 --env=HTTPS_PROXY=https://proxy.company:8080.
Follow-up: You've allowed egress to kube-system. Now team Finance is accessing team HR's services through an exposed service in kube-system. How would you tighten the policy?
You're monitoring cluster for multi-tenancy violations. A pod in team-marketing somehow calls into team-finance's database pod (they're in different namespaces). Your NetworkPolicy should block this but doesn't. Upon inspection, the pod labels don't match your policy rules. How would you debug and prevent this?
NetworkPolicy is label-based. If pod labels don't match the policy selector, the policy isn't applied. Check the problematic pod: kubectl get pods -n team-marketing pod-name -o yaml | grep -A 5 labels. If labels are missing or different, the policy won't match.
Example: NetworkPolicy says podSelector: {matchLabels: {app: "marketing-app"}}. But pod was created with --labels="tier=frontend" (no app label). The pod doesn't match the selector, so the policy doesn't apply. This is a common operational mistake: deployment doesn't set labels, or labels are typos.
Diagnosis: Test policy matching manually: kubectl -n team-marketing get pods -L app. Show which pods have "app" label. Check deployment spec: kubectl get deployment team-app -n team-marketing -o yaml | grep -A 10 labels. Verify template.metadata.labels includes the selectors used in NetworkPolicy.
Prevent recurrence: Implement admission control. Use OPA Gatekeeper to enforce label rules: violation { not input.request.object.metadata.labels.app } { not input.request.object.metadata.labels.version }—reject any pod without required labels. This forces all pods to have app and version labels on creation.
Also implement a security scanning daemonset: periodically run kubectl get pods -A -o yaml | check_labels.sh and alert if pods have no security labels or don't match namespace policy. For remediation: automatically patch non-conforming pods with default labels: kubectl patch pod pod-name -n namespace -p '{"metadata":{"labels":{"quarantine":"true"}}}', then track for manual review.
NetworkPolicy testing: Create test pods in team-marketing and verify they can't reach team-finance pods. Use kubectl run test-pod --image=curlimages/curl -- sleep 3600 in team-marketing, then kubectl exec test-pod -c test-pod -- curl http://team-finance-db:5432—should timeout or connection refused. If it succeeds, policy is not working (or CNI doesn't support it).
Follow-up: You've enforced labels and policies work in testing. But in prod, a pod with labels still reaches another namespace's service. The CNI is Calico. What would you check next?
You've successfully isolated 50 teams for 6 months. Now you need to enable cross-team collaboration: team-finance wants to use a shared ML inference service run by team-infra. Design how to allow this without breaking isolation.
Add explicit allow-list rules to NetworkPolicy instead of deny-all. Create a shared service in a dedicated namespace (team-infra-shared or platform namespace). Then allow ingress from team-finance pods to that service only.
Architecture: (1) Deploy ML service in team-infra-shared namespace. (2) Label the service pod: app: ml-inference, shared: "true". (3) In team-infra-shared, create NetworkPolicy allowing ingress: kind: NetworkPolicy, ingress: [{from: [{namespaceSelector: {matchLabels: {allow-shared: "true"}}}], ports: [{port: 5000}]}]. (4) Label team-finance namespace: kubectl label namespace team-finance allow-shared=true. (5) In team-finance, update NetworkPolicy to allow egress to team-infra-shared on port 5000.
Alternatively, use service mesh (Istio, Linkerd) to manage access declaratively: Instead of NetworkPolicy, define PeerAuthentication (mTLS) and AuthorizationPolicy. Example: kind: AuthorizationPolicy, spec: {selector: {matchLabels: {app: "ml-inference"}}, rules: [{from: [{source: {principals: ["cluster.local/ns/team-finance/sa/app"]}}], to: [{operation: {methods: ["POST"], paths: ["/infer"]}}]}]}. This allows team-finance's service account to POST to /infer only. Much more flexible than NetworkPolicy.
Audit cross-team access: Create metrics: network_policy_cross_namespace_packets{source_namespace: "team-finance", dest_namespace: "team-infra-shared"}. Alert if unexpected cross-namespace traffic appears. Log every cross-team request in audit webhook for compliance.
Billing/cost allocation: Since team-finance now uses team-infra's ML service, allocate costs. Use a chargeback model: measure requests/sec from team-finance to ML service, multiply by unit cost, bill team-finance. Implement via custom metrics: ml_inference_requests_total{tenant: "team-finance"}. This incentivizes teams to use shared services efficiently.
Gotcha: If you later want to revoke team-finance's access to ML service, ensure existing client connections are closed gracefully. Don't just delete NetworkPolicy—connections won't tear down immediately. Instead, remove allow rule and wait for pods to restart: kubectl delete pods -n team-finance --all (or use rolling restart). Better: implement connection draining in the ML service (gracefully close connections over 30s before pod terminates).
Follow-up: You've enabled cross-team access with explicit allow-lists. But now you need billing. How would you track that team-finance made 1M requests to team-infra's ML service and charge them appropriately?
Your 50-team cluster has node affinity policies set per team (team-finance uses premium nodes, team-marketing uses standard nodes). A developer mistakenly sets their pod to use premium nodes even though they're in the marketing team namespace. No RBAC rule explicitly forbids this. The pod schedules on expensive infrastructure, costing 10x more than intended.
Node affinity is set per pod spec: spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchExpressions[].key: "node-tier", values: ["premium"]. There's no namespace-level enforcement—each pod independently specifies where it can run. RBAC doesn't control pod affinity; it only controls who can create pods.
Prevent this with admission control. Use OPA Gatekeeper to enforce node affinity policies per namespace: package kubernetes.admission violation { affinity := input.request.object.spec.affinity.nodeAffinity; affinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[_].matchExpressions[_].values[_] == "premium" }. On violation (team-marketing pod requesting premium), the webhook rejects the pod: "team-marketing pods can only use standard nodes".
Implementation: (1) Label nodes: kubectl label nodes node1 node-tier=premium node2 node-tier=standard. (2) Set default pod affinity via MutatingWebhook: automatically add affinity to all pods based on namespace: if namespace is team-marketing, add nodeAffinity {node-tier: standard}. (3) Use ValidatingWebhook to reject non-conforming pods: if pod.namespace == "team-marketing" and pod.affinity.nodeAffinity.nodeSelector has "premium", reject.
Audit and chargeback: Track which pods ran on which nodes. Use kubelet metrics: kubelet_running_pods{node_name="premium-node-1"}. Aggregate by pod namespace and sum resource costs. Bill team-marketing for the pods that ran on premium nodes (and overcharge to discourage non-compliance).
Better approach: Use Kyverno (policy-as-code) instead of raw webhooks. Define: apiVersion: kyverno.io/v1, kind: ClusterPolicy, spec: {validationFailureAction: "audit", rules: [{name: "restrict-node-affinity", match: {resources: {namespaces: ["team-marketing"]}}, validate: {message: "team-marketing pods cannot use premium nodes", pattern: {spec: {affinity: {nodeAffinity: {*: "*"}}}} — but disallow premium}}]}. This is more maintainable and auditable than webhooks.
Follow-up: You've prevented premium node affinity for team-marketing. Now they request permission to run a batch job on premium nodes for 1 day (special event). How would you grant temporary exception without opening permanent loophole?
Your cluster spans 3 AWS availability zones. Team Finance requires data residency: all pods and volumes must stay in AZ-1 only. Other teams use all 3 AZs. How would you enforce this without breaking the multi-tenancy isolation model?
Implement zone-scoped node pools and pod affinity rules. Create separate node pools: zones-pool-az1 (only nodes in AZ-1), zones-pool-all (nodes in all 3 AZs). Label nodes: topology.kubernetes.io/zone=us-east-1a (automatic by kubelet). For team-finance namespace, enforce pod affinity: spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchExpressions[]: {key: "topology.kubernetes.io/zone", operator: "In", values: ["us-east-1a"]}.
Automate with MutatingWebhook: For any pod created in team-finance namespace, inject zone affinity: if pod.namespace == "team-finance", add spec.affinity.nodeAffinity {topology.kubernetes.io/zone: us-east-1a}. Pod creator doesn't have to specify it; it's added automatically. This prevents accidental multi-zone scheduling.
Volume enforcement: PersistentVolumes must also stay in AZ-1. Create StorageClass for team-finance with zone affinity: kind: StorageClass, provisioner: "ebs.csi.aws.com", parameters: {availability-zones: "us-east-1a"}. When team-finance creates a PVC, the CSI provisioner only creates the volume in AZ-1. RBAC: only allow team-finance to use this StorageClass: kind: Role, rules: [{apiGroups: [""], resources: ["persistentvolumeclaims"], verbs: ["create"], resourceNames: ["team-finance-sc"]}].
Audit: Implement continuous policy check. Every 60 seconds, run: kubectl get pods -n team-finance -o json | jq '.items[] | select(.spec.nodeName != null) | select(.spec.nodeName not in az1_nodes[]) | {pod: .metadata.name, node: .spec.nodeName, zone: "VIOLATION"}'. Alert if any pod runs outside AZ-1. If found, evict: kubectl delete pod -n team-finance pod-name --grace-period=30.
For disaster recovery, team-finance may want failover to AZ-2. Implement with temporary policy override: kubectl patch networkpolicy -n team-finance az1-enforce -p '{"spec":{"affinity": {"zones": ["us-east-1a", "us-east-1b"]}}}' during incident. Then revert after 1 hour: kubectl patch networkpolicy -n team-finance az1-enforce -p '{"spec":{"affinity": {"zones": ["us-east-1a"]}}}'. Log all overrides for compliance audit.
Cost optimization: Offer cost incentive for multi-zone. Charge team-finance 1.5x for single-zone (redundancy cost). If they agree to multi-zone, reduce to 1.0x. This encourages flexibility while respecting residency compliance.
Follow-up: Team Finance's pod was scheduled in AZ-1 but the zone affinity rule should have enforced it. The MutatingWebhook didn't fire. Why would this happen and how would you debug?