You deploy a default-deny NetworkPolicy (`ingress: []` and `egress: []`) to a namespace running 40 microservices. Within minutes, 18 services start failing with connection timeouts. Your monitoring shows DNS queries failing. What's the root cause and how do you fix it without rolling back?
The issue: a default-deny policy blocks traffic to kube-dns (CoreDNS) in the kube-system namespace. DNS resolution fails, cascading failures across dependent services.
Root cause diagnosis:
kubectl get netpol -A to list all policies
kubectl describe netpol default-deny -n production to inspect your policy
kubectl exec -it pod-name -n production -- nslookup kubernetes.default to test DNS directly
Check logs: kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Verify coredns pods aren't blocked: kubectl get pods -n kube-system -o wide | grep coredns
Fix without rollback: create explicit egress rules allowing DNS:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 5432
Verify immediately: kubectl apply -f allow-dns.yaml && kubectl exec pod-name -- nslookup google.com
Follow-up: How would you enforce DNS egress to only your internal DNS resolver (10.0.0.10) and block external DNS queries? Design a policy that logs violations.
Your payment service needs to call a third-party API at 52.48.12.15:443, but it's failing intermittently. Devs say the connection sometimes works, sometimes doesn't. Packet traces show the traffic is leaving the pod but responses aren't arriving. What's happening and how do you debug this?
This is likely asymmetric routing or return-traffic blocking. The outbound NetworkPolicy allows egress to the external IP, but the CNI or return path is dropping responses.
Debug steps:
1. Check the NetworkPolicy egress rules:
kubectl get netpol payment-service -o yaml | grep -A 20 egress
2. Verify the rule actually targets 52.48.12.15/32 (not a broader CIDR that's being restricted elsewhere):
kubectl describe netpol payment-service
3. Test from inside pod with verbose output:
kubectl exec -it payment-pod -- curl -v https://52.48.12.15:443
4. Check CNI plugin constraints:
kubectl get ds -n kube-system -o wide | grep -E 'calico|cilium|flannel'
5. Inspect pod's network namespace for iptables rules:
kubectl exec pod-name -- iptables-save | grep 52.48.12.15
6. Check node-level iptables on the node hosting the pod:
ssh node-ip sudo iptables-save | grep -E 'ACCEPT|DROP' | tail -20
Common fix: if the external service is behind a NAT/load balancer, responses come from different IPs. Update egress policy to allow a CIDR range or use a Calico EgressPolicy with domain-based rules:
apiVersion: projectcalico.org/v3
kind: EgressPolicy
metadata:
name: external-api-egress
spec:
selector: app == 'payment-service'
egress:
- action: Allow
destination:
domains:
- 'api.thirdparty.com'
ports:
- 443
Follow-up: How do you handle external service failover if the API moves to a different IP range? What's the operational cost of domain-based egress policies?
Your frontend pods can't reach backend pods across a NetworkPolicy boundary. You check the policy and it looks correct—`podSelector` and `namespaceSelector` are set. But traffic is still blocked. You test with `kubectl run debug-pod` and that works. What's different?
The `debug-pod` you created has no labels, so it matches the empty `podSelector: {}` (which means all pods). Your actual pods have labels that may NOT match the policy's selector.
Diagnosis:
1. Get the labels on your blocked frontend pod:
kubectl get pod frontend-1 -o jsonpath='{.metadata.labels}'
2. Get the NetworkPolicy and inspect its selectors:
kubectl get netpol allow-backend -o yaml
3. Check if the pod labels match the policy's podSelector. If your policy says:
to:
- podSelector:
matchLabels:
app: backend
tier: api
...but your backend pods only have `app: backend`, they won't match because tier is missing.
4. Verify the backend namespace has the required label:
kubectl get namespace backend -o jsonpath='{.metadata.labels}'
Fix: update your NetworkPolicy selector to match actual labels:
to:
- podSelector:
matchLabels:
app: backend
Or add missing labels to pods:
kubectl label pods -l app=backend tier=api
Test with netcat:
kubectl exec frontend-pod -- nc -zv backend-service 8080
Follow-up: How would you design a label strategy to make NetworkPolicies maintainable at scale (200+ services)? What's the difference between label-based and namespace-based policies?
You're running a multi-tenant cluster. Tenant A's workload is exfiltrating data to a suspicious external IP (103.21.4.50). You want to immediately block ALL egress from Tenant A to external networks, but allow internal cluster traffic and kube-system access. Your NetworkPolicy blocks it, but Tenant A reports that their apps can still reach external IPs. Why?
Most likely issue: pod security policy and NetworkPolicy are different mechanisms. A NetworkPolicy denies traffic at the CNI level, but if the pod has `host` networking enabled or if traffic is routed through the host, it bypasses the policy.
Diagnosis:
1. Check if pods are using host networking:
kubectl get pods -n tenant-a -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.hostNetwork}{"\n"}{end}'
2. Check if there are any exception routes on the node:
ssh node-ip route -n | grep -E '103.21|default'
3. Verify the NetworkPolicy is actually applied:
kubectl get netpol -n tenant-a
kubectl describe netpol deny-external-egress -n tenant-a
4. Check pod labels match the policy selector:
kubectl get pods -n tenant-a -o wide
Fix:
1. If hostNetwork is enabled, disable it (unless absolutely necessary):
kubectl set spec pods -n tenant-a hostNetwork=false (via PodDisruptionBudget recreation)
2. Apply strict NetworkPolicy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-external-egress
namespace: tenant-a
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- podSelector: {}
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- to:
- namespaceSelector:
matchLabels:
name: kube-public
3. Verify with egress test:
kubectl exec -it tenant-a-pod -- curl -I http://103.21.4.50 --connect-timeout 5 (should timeout)
Follow-up: How would you detect data exfiltration attempts across your cluster? What's the monitoring/alerting strategy for suspicious egress patterns?
You're building a shared cluster for 15 teams. Each team's namespace needs to communicate only with its own services and shared services (logging, monitoring). A developer in Team A accidentally applied a broad NetworkPolicy that blocks Team B. How do you design policies that prevent this kind of accident while maintaining security?
Design a hierarchical NetworkPolicy strategy with namespace defaults, ingress whitelists, and egress tiers.
Step 1: Create a default-deny for each namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Apply this via a template to all namespace creation (admission webhook).
Step 2: Create allow-policies for expected traffic patterns:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-intra-team-and-shared
namespace: team-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector: {}
- from:
- namespaceSelector:
matchLabels:
shared: 'true'
ports:
- protocol: TCP
port: 9090 # prometheus
egress:
- to:
- podSelector: {}
- to:
- namespaceSelector:
matchLabels:
shared: 'true'
ports:
- protocol: TCP
port: 5140 # syslog
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
Step 3: Prevent cross-team accidents with policy audit mode. Use Kubernetes audit logs:
kubectl get events -n team-a --sort-by='.lastTimestamp' | grep 'NetworkPolicy'
Or integrate with Cilium Network Policy observability:
kubectl logs -n cilium -l k8s-app=cilium --grep='DENIED' | tail -50
Step 4: Use NetworkPolicyStatusSubresource (beta/stable) to monitor which policies are blocking traffic:
kubectl get networkpolicystatus -n team-a
Monitoring: export metrics from your CNI
kubectl port-forward -n cilium svc/cilium-agent 6543:6543 &
curl http://localhost:6543/metrics | grep policy
Follow-up: How do you version control NetworkPolicies at scale? What's your rollback strategy if a policy breaks 5 teams at once?
Your cluster uses Cilium for network policies. You've configured `policy: strict` in the CNI but observability is blind—you don't know which traffic is being denied or why. Your on-call gets paged 3 times a week with "connection refused" errors that disappear on retry. How do you build visibility and confidence?
Build a multi-layer observability stack: Cilium network policy logs, metrics, and tracing.
Layer 1: Enable Cilium policy observability mode (no-drop logging):
kubectl exec -n cilium cilium-xxxxx -- cilium config TracePolicy=True
kubectl exec -n cilium cilium-xxxxx -- cilium config MonitorAggregationLevel=medium
This logs all denied connections without dropping them (useful for audit).
Layer 2: Export metrics to Prometheus:
kubectl get svc -n cilium cilium-agent
kubectl port-forward -n cilium svc/cilium-agent 6543:6543 &
# Scrape config in Prometheus:
# - job_name: cilium
# static_configs:
# - targets: ['localhost:6543']
Query denied policies:
cilium_policy_l7_denied_total
cilium_policy_drop_total
cilium_policy_stats_total{decision="DENIED"}
Layer 3: Use Hubble for flow-level visibility:
hubble observe --pod team-a --verdict DENIED
hubble observe --from-pod team-a --to-namespace kube-system
Or watch in real-time via web UI:
kubectl port-forward -n cilium svc/hubble-ui 12000:80 &
# Then visit localhost:12000
Layer 4: Root-cause intermittent connection issues with packet-level debugging:
kubectl exec -n cilium cilium-xxxxx -- cilium monitor -t policy | grep -A 5 "team-a-pod"
Check for policy race conditions:
kubectl logs -n cilium -l k8s-app=cilium --tail=200 | grep -E 'policy|state-change'
Recommended setup for on-call:
Alert on: rate(cilium_policy_drop_total[5m]) > 10
Dashboard: Grafana with Cilium datasource showing top-denied pod pairs
Runbook: "When paged for connection refused, check: `hubble observe --verdict DENIED | tail -20` for the specific pod pair, then validate NetworkPolicy rules."
Follow-up: How do you test NetworkPolicies before deploying to production? Design a validation pipeline that catches misconfigurations.
You're running a Kubernetes cluster with multi-cloud failover. A pod in GCP needs to call a microservice running in AWS. The AWS service has a dynamic IP pool (503.x.x.x/16). Your NetworkPolicy allows the CIDR, but traffic randomly fails. When you add the full /8 (503.0.0.0/8), it works but feels risky. What's the production-grade solution?
The problem: static CIDR policies don't handle dynamic IPs well. Multiple solutions exist, each with tradeoffs.
Option 1: Service mesh (Istio/Linkerd) with authorization policies
Decouples from IPs entirely:
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: cross-cloud-access
spec:
selector:
matchLabels:
app: gcp-pod
rules:
- to:
- operation:
hosts: ["aws-service.aws-ns.svc.cluster.local"]
ports: ["8080"]
Install Istio sidecar proxies, which handle routing logic.
Option 2: DNS-based Calico EgressPolicy (if using Calico):
apiVersion: projectcalico.org/v3
kind: EgressPolicy
metadata:
name: aws-service-egress
spec:
selector: cluster == 'gcp'
egress:
- action: Allow
destination:
domains:
- 'aws-service.internal.aws-region.amazonaws.com'
ports:
- 8080
Calico resolves the domain at policy time and keeps an internal mapping. Check resolution:
kubectl exec -it gcp-pod -- nslookup aws-service.internal.aws-region.amazonaws.com
Option 3: Tunnel + NetworkPolicy (if you can't use mesh/egress policies)
Create a StatefulSet in AWS namespace exposing a stable internal endpoint:
kubectl apply -f - <
Then create a NetworkPolicy for the stable proxy IP:
to:
- podSelector:
matchLabels:
app: aws-proxy
ports:
- protocol: TCP
port: 8080
Option 4: Accept the /8 CIDR if it's operationally acceptable
to:
- ipBlock:
cidr: 503.0.0.0/8
ports:
- protocol: TCP
port: 8080
Document and alert if traffic from unexpected ranges is detected:
rate(cilium_policy_drop_total{direction="EGRESS", reason="Invalid source"}[5m]) > 0
Recommendation: Use Option 2 (DNS-based Calico policies) for production multi-cloud. It's secure, maintainable, and doesn't require application-level changes.
Follow-up: If DNS-based egress policies can't resolve external domains because your egress policy denies external DNS, how do you bootstrap this chicken-and-egg problem?
Your cluster has a security incident: a compromised pod in namespace A was able to reach and exfiltrate data from namespace B (which should have been blocked). Your NetworkPolicy looks correct on paper, but the incident shows it wasn't actually enforced. Where are the gaps and how do you audit them?
NetworkPolicies only work if the CNI plugin enforces them. Common gaps: CNI not installed, policies not being reconciled, or policies applied only to some nodes.
Comprehensive audit checklist:
1. Verify CNI is running and enforcing policies:
kubectl get daemonset -A | grep -E 'calico|cilium|flannel'
kubectl get pods -n kube-system -l k8s-app=calico-node -o wide
Check node status:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
2. Verify policies are applied to all nodes:
kubectl describe node node-1 | grep -A 5 calico
kubectl exec -n kube-system calico-node-xxxxx -- calico-node status
On the node itself:
ssh node-1 sudo iptables -L -n | grep policy
3. Check if policies are actually in the CNI plugin's state:
For Calico:
kubectl exec -n kube-system calico-xxxxx -- calicoctl get policies
kubectl exec -n kube-system calico-xxxxx -- calicoctl describe policy namespace-a-deny
For Cilium:
kubectl exec -n cilium cilium-xxxxx -- cilium policy list
kubectl exec -n cilium cilium-xxxxx -- cilium policy get 123
4. Simulate the attack path and check if it's blocked:
kubectl run attacker --image=curlimages/curl -it --rm --namespace=namespace-a -- sh
# Inside the pod:
nc -zv victim-pod.namespace-b.svc.cluster.local 5432
If it succeeds when it shouldn't, the policy isn't enforced.
5. Check policy selector labels on the compromised pod:
kubectl get pod compromised-pod -n namespace-a -o yaml | grep -A 5 labels
Compare to your NetworkPolicy selectors.
6. Validate NetworkPolicy YAML syntax and references:
kubectl apply -f policy.yaml --dry-run=client -o yaml
Check for typos in namespace or pod selectors.
7. Review audit logs for policy changes:
kubectl get events -n namespace-a --field-selector involvedObject.kind=NetworkPolicy
kubectl logs -n kube-system -l component=kube-apiserver | grep 'NetworkPolicy' | tail -50
Post-incident fix:
1. Immediately apply policies that block the attack path:
kubectl apply -f deny-namespaceA-to-namespaceB.yaml
2. Test enforcement:
kubectl run test-pod -n namespace-a --image=alpine -- sleep 1000
kubectl exec test-pod -n namespace-a -- ping namespace-b-service.namespace-b (should fail)
3. Review and audit all existing policies:
kubectl get netpol -A -o yaml | grep -E 'name:|podSelector:|namespaceSelector:' | head -50
4. Enable continuous policy testing in your CI/CD or use tools like Kyverno to enforce policy rules.
Follow-up: How would you design a compliance framework that ensures NetworkPolicies are always enforced? What's your audit and remediation workflow?
Your compliance team demands: "All egress from production must be explicitly whitelisted. Anything not in the whitelist must be denied and logged." You have 150+ services with diverse egress requirements. How do you implement and maintain this without causing 24/7 on-call hell?
Build a policy-as-code infrastructure with progressive enforcement and observability.
Phase 1: Discovery and audit (2 weeks)
Run policies in audit mode to discover current egress traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: audit-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- podSelector: {}
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
Export denied traffic from logs:
kubectl logs -n cilium -l k8s-app=cilium --grep='DENIED' > denied-traffic.log
# Parse into egress rules template
Phase 2: Tiered whitelist (per service type)
Create reusable policies for common patterns:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: egress-database-tier
namespace: production
spec:
podSelector:
matchLabels:
tier: database
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
tier: cache
ports:
- protocol: TCP
port: 6379
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: egress-api-tier
namespace: production
spec:
podSelector:
matchLabels:
tier: api
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
tier: database
ports:
- protocol: TCP
port: 5432
- to:
- podSelector:
matchLabels:
tier: cache
ports:
- protocol: TCP
port: 6379
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32 # AWS metadata
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 80
Phase 3: Automate policy generation
Create a template system for teams to request egress rules:
cat policy-request.yaml
---
app: payment-service
tier: api
required-egress:
- destination: stripe.com
port: 443
reason: "Payment processing"
- destination: kube-system:53
port: 53
reason: "DNS"
Use Kustomize or Helm to generate policies:
helm template egress-policies -f policy-request.yaml | kubectl apply -f -
Phase 4: Monitoring and alerting
Alert on unknown egress attempts:
- alert: UnauthorizedEgress
expr: rate(cilium_policy_drop_total{direction="EGRESS"}[5m]) > 5
annotations:
summary: "{{ $labels.policy_name }} is blocking traffic"
runbook: "Check /runbooks/unauthorized-egress.md"
Create a self-service dashboard:
grafana query: cilium_policy_drop_total by (policy_name, destination_ip)
filter by: direction="EGRESS", namespace="production"
Phase 5: Change management
Require approval for new egress rules:
1. Developer submits PR with new policy
2. Security team reviews IP/domain/port against vulnerability database
3. Auto-check: "Is this internal service? Is this a known CDN?"
4. Once approved, policy is deployed with 24-hour rollback window
Rollback automation:
kubectl rollout undo deployment/netpol-controller -n kube-system
Follow-up: What happens when a service legitimately needs to call a new external API? Walk me through your request, approval, and rollback workflow.