Your cluster runs Flannel VXLAN overlay. You're scaling to 500 nodes across AWS regions. Bandwidth costs are spiraling—$12K/month in inter-region traffic. Your network team says "Why are you encapsulating everything?" You realize Flannel encapsulation is adding 50+ bytes to every packet. Can you switch to Calico/BGP routing without rebuilding the cluster?
Yes, but it requires careful planning. Flannel VXLAN vs Calico BGP represent different architectural philosophies: overlay vs underlay. The migration has network implications.
Phase 1: Understand current state
kubectl get daemonset -n kube-system -o wide | grep flannel
kubectl describe daemonset kube-flannel -n kube-system | grep Image
kubectl exec -n kube-system kube-flannel-xxxxx -- ip route show
kubectl exec -n kube-system kube-flannel-xxxxx -- cat /etc/kube-flannel/net-conf.json
Confirm VXLAN is active:
ssh node-1 ip link show | grep vxlan
ssh node-1 bridge fdb show | head -10
Phase 2: Plan Calico migration
Option A: Rolling replacement (preferred for large clusters)
- Install Calico components alongside Flannel (Calico as policy controller, Flannel continues routing)
- Drain nodes one by one, uninstall Flannel, install Calico CNI
- Validate pod-to-pod connectivity after each node
Option B: Create a new cluster, migrate via service mesh (safest but expensive)
Step 1: Install Calico operator and resources in monitoring mode:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/tigera-operator.yaml
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/custom-resources.yaml
Step 2: Configure Calico to coexist (CNI chaining):
apiVersion: projectcalico.org/v1alpha1
kind: CNIConfiguration
metadata:
name: default
spec:
containerRuntime: containerd
cniSchemaVersion: 1.0
Verify coexistence:
kubectl get daemonset -n calico-system
kubectl get pods -n calico-system -o wide
Step 3: Drain and migrate nodes (one per hour to monitor impact):
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
ssh node-1 sudo systemctl stop kubelet
ssh node-1 sudo rm -rf /var/lib/cni/flannel /etc/cni/net.d/*flannel*
ssh node-1 sudo systemctl start kubelet
# Wait for Calico to initialize
sleep 30
kubectl describe node node-1 | grep -E 'Ready|network'
Test pod connectivity:
kubectl run test-pod-1 --image=alpine -it --rm -- ping test-pod-2.default.svc.cluster.local
Phase 3: Switch Calico to BGP (underlay) mode
Edit BGP configuration:
kubectl apply -f - <
Configure BGP peers (your network routers):
kubectl apply -f - <
Verify BGP peering:
kubectl exec -n calico-system calico-node-xxxxx -- calicoctl node status
Expected output: "Calico process is running" and "BGP status: up"
Phase 4: Monitor and validate
Compare bandwidth before/after:
ssh node-1 sar -n DEV 1 5 | grep -E 'eth0|vxlan'
# Check packet overhead reduction
ping -c 100 -s 1472 pod-ip # Max before fragmentation
Expected: bandwidth usage drops 5-10% due to eliminated VXLAN encapsulation (50-byte header).
Cost savings: 500 nodes × 10% reduction = ~$1.2K/month saved.
Rollback plan: Keep Flannel manifests in GitOps repo with version tag. If issues occur:
git checkout flannel-v0.21.0
kubectl apply -f flannel-daemonset.yaml
kubectl drain node-1 --ignore-daemonsets
ssh node-1 sudo systemctl stop kubelet && sudo rm -rf /var/lib/cni/calico && sudo systemctl start kubelet
Follow-up: BGP requires your infrastructure team to configure the routers. What happens if they refuse? Design a hybrid approach that reduces costs without requiring router changes.
You've just switched from Flannel to Cilium. Pod-to-pod connectivity works fine, but now kube-proxy is gone and some services are broken. Your NodePort services don't respond. What happened and how do you debug?
Cilium replaces kube-proxy entirely, but the replacement isn't automatic. Cilium needs specific configuration to handle LoadBalancer services and NodePorts. If services are broken, you likely have a Cilium service proxy misconfiguration or a mismatch between Cilium's eBPF and your service topology.
Debug flow:
1. Verify Cilium replaced kube-proxy:
kubectl get pods -n kube-system -l app=kube-proxy
# Should return nothing
Verify Cilium agents are running:
kubectl get daemonset -n cilium
kubectl get pods -n cilium -o wide
2. Check Cilium configuration for services:
kubectl get configmap cilium-config -n cilium -o yaml | grep -E 'kube-proxy|service-proxy-name|bpf-map-dynamic-size-ratio'
Ensure key configs are set:
bpf-map-dynamic-size-ratio: "0.25" # Allocate eBPF maps
enable-host-port: "true" # For NodePorts
enable-node-port: "true" # For NodePort services
3. Test NodePort directly:
kubectl get svc | grep NodePort
kubectl exec -it debug-pod -- curl http://node-ip:node-port
# If it fails, the service proxy isn't working
4. Check Cilium service map:
kubectl exec -n cilium cilium-xxxxx -- cilium service list
kubectl exec -n cilium cilium-xxxxx -- cilium service get 1234
If the service isn't listed, it wasn't programmed into eBPF.
5. Inspect eBPF maps directly:
kubectl exec -n cilium cilium-xxxxx -- bpftool map show
kubectl exec -n cilium cilium-xxxxx -- bpftool map dump name cilium_lb_services_v4 | head -20
Verify your service IP is in the map.
6. Check Cilium logs for errors:
kubectl logs -n cilium -l k8s-app=cilium --tail=100 | grep -i 'service\|lb\|proxy'
Common error: "failed to create service entry" or "eBPF map full"
Fix: If eBPF map is full, increase map sizes
kubectl set env daemonset/cilium -n cilium BPF_MAP_DYNAMIC_SIZE_RATIO=0.5
kubectl rollout restart daemonset/cilium -n cilium
sleep 30
Restart Cilium agents:
kubectl rollout restart daemonset/cilium -n cilium
# Monitor for completion
kubectl rollout status daemonset/cilium -n cilium
7. Validate NodePort again:
kubectl exec -it debug-pod -- curl http://node-ip:node-port -v
# Should succeed now
Prevention: When migrating from kube-proxy to Cilium, always:
1. Run compatibility check first: cilium preflight connectivity-check
2. Enable Cilium's monitoring: hubble observe --verdict DROPPED
3. Test all service types (ClusterIP, NodePort, LoadBalancer) in audit mode before cutover
Follow-up: How would you handle session affinity (sticky sessions) without kube-proxy? Design a solution that works for gRPC and WebSocket traffic.
You're running Calico on a 100-node cluster. Monitoring shows high CPU on calico-node pods and slow pod startup times (45 seconds vs normal 5 seconds). The calico-node pods are consuming 800m CPU each. What's the bottleneck and how do you investigate?
High CPU in calico-node typically indicates: policy reconciliation storms, BGP churn, or eBPF map contention. Pod startup slowness suggests the CNI plugin is blocking on IP allocation or policy programming.
Debug sequence:
1. Correlate CPU spike with events:
kubectl top pods -n calico-system -l k8s-app=calico-node --containers
kubectl describe pod calico-node-xxxxx -n calico-system | grep -A 10 Events
Check if spikes correlate with pod deployments, node additions, or policy updates.
2. Check BGP stability:
kubectl exec -n calico-system calico-node-xxxxx -- calicoctl node status
# Expected: BGP status: up
# If showing "down" or frequent changes, BGP is thrashing
Monitor BGP peering flaps:
kubectl logs -n calico-system -l k8s-app=calico-node | grep -E 'bgp.*state|Peer.*Up|Peer.*Down' | tail -50
High volume of Up/Down events = peering instability.
3. Check policy reconciliation load:
kubectl exec -n calico-system calico-node-xxxxx -- calicoctl get policies | wc -l
# Count number of policies
calicoctl get networkpolicies | wc -l
# Count NetworkPolicies
If you have 1000+ policies, reconciliation becomes expensive.
Profile policy processing:
kubectl logs -n calico-system -l k8s-app=calico-node --tail=500 | grep -E 'Reconcile|ProcessUpdate' | wc -l
4. Monitor IP allocation performance:
kubectl describe daemonset calico-node -n calico-system | grep -A 5 "Limits\|Requests"
# Check memory and CPU limits
Run a deployment spike and measure pod startup time:
time kubectl run test-{1..10} --image=alpine --overrides='{"spec":{"terminationGracePeriodSeconds":0}}'
# Measure end-to-end time
kubectl get pods -o wide | grep test- | wc -l
5. Check eBPF map usage:
ssh node-1 sudo bpftool map show | grep -E 'cali|felix'
ssh node-1 sudo bpftool map dump name cali_v4 | wc -l
If maps are at 95%+ capacity, Calico can't program new routes efficiently.
Common fixes:
Fix 1: Increase calico-node resource limits
kubectl set resources daemonset calico-node -n calico-system --limits=cpu=1,memory=512Mi --requests=cpu=500m,memory=256Mi
Fix 2: Reduce policy churn by batching updates
kubectl patch configmap calico-config -n calico-system --type merge -p '{"data":{"policy_update_batch_size":"50"}}'
Fix 3: Enable Felix CPU profiling to identify exact hotspots
kubectl set env daemonset/calico-node -n calico-system FELIX_CPUPROFILINGFILE=/tmp/felix.prof
kubectl rollout restart daemonset/calico-node -n calico-system
sleep 120
kubectl exec -n calico-system calico-node-xxxxx -- cat /tmp/felix.prof > felix.prof
go tool pprof felix.prof
# top10 to see hottest functions
Fix 4: Split policies into smaller, more specific rules
Instead of:
spec:
podSelector: {} # Matches all pods
ingress:
- from:
- podSelector: {} # 1000 rules evaluated per pod
Use labeled tiers:
spec:
podSelector:
matchLabels:
tier: api # Narrower scope, fewer rules to evaluate
Follow-up: How do you scale a single Calico deployment to handle 1000+ nodes? At what point do you need to switch architectures?
Your cluster spans 3 availability zones in the same region. You're using Flannel with VXLAN overlay. Pod A in AZ1 pings Pod B in AZ3—latency is 35ms instead of expected 2-3ms. Network engineers say the underlay is fine. Why is the overlay adding so much latency and how do you fix it?
VXLAN overlay encapsulation can introduce latency through multiple mechanisms: increased packet size causing fragmentation, MTU mismatches, or additional processing in the VXLAN tunnel endpoints.
Investigate latency source:
1. Verify underlay latency is good:
ssh node-az1 ping -c 100 node-az3-ip | grep avg
# Expected: 1-3ms
2. Check pod-to-pod latency in detail:
kubectl run latency-test-az1 --image=nicolaka/netshoot -n default --node-selector zone=us-east-1a
kubectl run latency-test-az3 --image=nicolaka/netshoot -n default --node-selector zone=us-east-1c
kubectl exec latency-test-az1 -- bash
# Inside pod:
for i in {1..100}; do ping -c 1 latency-test-az3; done | tee latency.txt
grep time= latency.txt | awk -F'time=' '{print $2}' | awk -F' ' '{print $1}' | sort -n | tail -1
Isolate latency: measure pod-to-node, node-to-node, node-to-pod to find bottleneck.
3. Check MTU and fragmentation:
kubectl exec latency-test-az1 -- ping -c 3 -M do -s 1472 latency-test-az3
# If "Frag needed but DF set", MTU is too small
Check current MTU on nodes:
ssh node-az1 ip link show | grep mtu
VXLAN adds 50-byte overhead (14 + 20 + 8 + 8 = 50). If physical MTU is 1500, VXLAN MTU should be 1450.
ssh node-az1 ip link set dev flannel.1 mtu 1450
Verify on Flannel config:
kubectl get daemonset kube-flannel -n kube-system -o yaml | grep -A 5 args
4. Enable Flannel VXLAN UDP port optimization:
kubectl get daemonset kube-flannel -n kube-system -o yaml | grep -E 'Backend|Type|Directrouting'
If Directrouting is disabled, enable it:
kubectl edit daemonset kube-flannel -n kube-system
# Edit net-conf.json:
# "Backend": {
# "Type": "vxlan",
# "Directrouting": true # Direct routing for same-AZ pods
# }
5. Measure VXLAN processing overhead:
ssh node-az1 ethtool -S eth0 | grep -E 'rx_csum|tx_csum|rx_packets|tx_packets'
# Compare before/after enabling hardware offload
Enable TSO (TCP Segmentation Offload) and GSO (Generic Segmentation Offload) if supported:
ssh node-az1 ethtool -K eth0 tso on gso on
6. Check if cross-AZ traffic is being unnecessarily routed through a NAT/gateway:
traceroute latency-test-az3-ip
# Verify direct node-to-node path, not through a gateway
7. Consider switching to Calico BGP (underlay) if latency is critical
With BGP, packets aren't encapsulated—they're routed directly by the underlay network. Latency drops to underlay baseline (1-3ms).
Quick fix ranking by impact:
1. Enable Directrouting (immediate, ~5ms reduction)
2. Fix MTU mismatch (immediate, if fragmentation is happening)
3. Enable hardware offload (2-3ms reduction)
4. Migrate to BGP/underlay (5-10ms reduction, but requires architecture change)
Follow-up: Your latency-sensitive trading application needs sub-1ms pod-to-pod latency. Which CNI would you choose and why? Design the network architecture.
You're choosing between Calico, Cilium, and Flannel for a new production cluster. Your requirements: 300 nodes, multi-region, policy enforcement, load balancing, and cost control. You have 2 weeks to decide and 4 weeks to deploy. Which do you pick and why? Walk through your evaluation criteria.
Evaluation framework (score each on scale 1-10):
Criterion 1: Operational Complexity
Flannel: 9/10 (simple, fewer moving parts)
Calico: 6/10 (more config, especially for BGP)
Cilium: 3/10 (eBPF learning curve, requires kernel expertise)
Winner: Flannel if you want low ops burden; Cilium if you're willing to invest.
Criterion 2: Policy Enforcement & Observability
Flannel: 3/10 (no native policies, requires Calico overlay)
Calico: 8/10 (rich policy language, but not east-west observability)
Cilium: 10/10 (Hubble provides packet-level flow visibility, L7 policies)
Winner: Cilium for security/compliance; Calico for policy-heavy workloads.
Criterion 3: Multi-Region Support
Flannel: 5/10 (VXLAN works, but high bandwidth cost across regions)
Calico: 9/10 (BGP with route reflection, designed for multi-region)
Cilium: 7/10 (Cilium Mesh exists, still maturing)
Winner: Calico for cost-efficient multi-region.
Criterion 4: Load Balancing (replacing kube-proxy)
Flannel: 0/10 (requires kube-proxy)
Calico: 0/10 (requires kube-proxy)
Cilium: 9/10 (eBPF-based, replaces kube-proxy, supports session affinity)
Winner: Cilium if you want modern service proxy.
Criterion 5: Resource Overhead
Flannel: 9/10 (20-50m CPU, 100-200m memory per node)
Calico: 6/10 (150-300m CPU, 300-500m memory)
Cilium: 4/10 (500-800m CPU, 1G memory, but includes kube-proxy)
Winner: Flannel for resource-constrained; Cilium competitive if you discount kube-proxy savings.
Criterion 6: Community & Production Maturity
Flannel: 9/10 (stable for years)
Calico: 10/10 (widely deployed, mature)
Cilium: 8/10 (growing adoption, still some instability reports)
Winner: Calico; Flannel is safe; Cilium is modern.
For a 300-node multi-region cluster, I'd recommend:
Primary choice: Calico (BGP mode) if cost and stability are priorities
Alternative: Cilium if you need advanced observability and want kube-proxy replacement
Skip: Flannel for multi-region (high bandwidth costs)
Decision logic:
IF (policy_enforcement == "critical" AND observability == "high") THEN Cilium
ELSE IF (multi_region == TRUE AND cost == "critical") THEN Calico
ELSE IF (simplicity == "priority") THEN Flannel (but single-region only)
Deployment timeline for Calico:
Week 1: Lab testing (3 nodes, 2 regions)
Week 2: BGP peer config with network team, policy design
Week 3: Staging deployment (50 nodes, shadow traffic)
Week 4: Production rollout (rolling 50 nodes/week)
Risk mitigation:
- Keep kube-proxy as fallback (don't remove immediately)
- Test policy updates in canary namespace first
- Monitor BGP stability during first 2 weeks
- Maintain Flannel manifests for rollback
Follow-up: Your cluster has mixed workloads: latency-sensitive services (1ms requirement) and batch jobs (cost-optimized). Can you use different CNIs for different workload types? Design this hybrid architecture.
You've deployed Cilium with eBPF on a cluster running older Linux kernels (4.9). Services work sometimes. You're seeing random packet drops and sporadic connection resets. Cilium logs show "bpf verifier error." What's happening and how do you recover?
eBPF programs require specific Linux kernel features. Older kernels (pre-5.x) have incomplete eBPF support, missing helpers, and verifier limitations. This causes runtime failures and unpredictable packet loss.
Diagnosis:
1. Check kernel version on nodes:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'
If you see 4.9.x or 4.14.x, you've identified the problem.
2. Verify eBPF verifier errors:
kubectl logs -n cilium -l k8s-app=cilium | grep -i 'verifier'
# Look for: "invalid memory size" or "unreachable instructions"
3. Check eBPF program load status:
kubectl exec -n cilium cilium-xxxxx -- cilium bpf list
# Check if programs loaded successfully
If programs show "FAILED", eBPF isn't actually active.
4. Confirm kernel capabilities:
ssh node-1 cat /boot/config-$(uname -r) | grep -E 'CONFIG_BPF|CONFIG_HAVE_EBPF_JIT|CONFIG_BPF_EVENTS'
# Should all be =y
On older kernels, many of these will be missing or =m (module).
Recovery options:
Option A: Run Cilium in "legacy" mode without eBPF (immediate, but loses performance benefits)
kubectl set env daemonset/cilium -n cilium CILIUM_DEVICE=eth0 CILIUM_DISABLE_EBPF=true
# Or via helm:
helm upgrade cilium cilium/cilium \
--set ebpf.enabled=false \
--set kubeProxyReplacement=partial
kubectl rollout restart daemonset/cilium -n cilium
This falls back to iptables-based packet processing (kube-proxy-like). Performance degradation: ~10-15% throughput loss.
Option B: Update nodes to newer kernel (2-3 hours downtime per node)
ssh node-1 sudo apt-get update && sudo apt-get install -y linux-image-5.15
ssh node-1 sudo reboot
# Wait for node to rejoin cluster
kubectl wait --for=condition=Ready node/node-1 --timeout=5m
After kernel upgrade, restart Cilium:
kubectl set env daemonset/cilium -n cilium CILIUM_DISABLE_EBPF=false
kubectl rollout restart daemonset/cilium -n cilium
Option C: Replace with Calico (safer, but requires CNI switch)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/tigera-operator.yaml
# See Network Policies question for full migration steps
Immediate fix (stabilize cluster):
1. Enable Cilium legacy mode NOW to restore stability
2. Plan kernel upgrades for this weekend
3. Test eBPF mode in staging with new kernels
4. Gradually migrate nodes: drain node → upgrade kernel → rejoin → enable eBPF
Prevention for future:
- Document minimum kernel requirement (5.10+) in runbook
- Add pre-flight check: cilium preflight kernel-check
- Include kernel version in node provisioning script:
#!/bin/bash
MIN_KERNEL_VERSION="5.10"
CURRENT=$(uname -r | cut -d. -f1,2)
if [ "$CURRENT" -lt "$MIN_KERNEL_VERSION" ]; then
echo "ERROR: Kernel too old for Cilium eBPF"
exit 1
fi
Follow-up: You're pinned to old kernel due to legacy workload dependencies. How would you run Cilium alongside a kernel that doesn't support eBPF? Design a workaround.
You have a legacy application that requires IP spoofing capability (custom network stacks, real-time packet shaping). Your CNI plugin (Calico) normally prevents this for security. How do you safely enable IP spoofing for specific pods while keeping default deny for others?
IP spoofing is a privileged capability. CNIs block it by default via reverse-path filtering (rp_filter) and network namespacing. To allow selective spoofing, you need to bypass the CNI's restrictions at the pod level while maintaining cluster security.
Approach: Use pod security policies + custom eBPF rules + network namespace overrides.
Step 1: Create a SecurityPolicy for spoofing-enabled pods
apiVersion: v1
kind: Namespace
metadata:
name: legacy-network-apps
labels:
require-spoofing: "true"
---
apiVersion: policy.k8s.io/v1beta1
kind: PodSecurityPolicy
metadata:
name: allow-spoof
namespace: legacy-network-apps
spec:
privileged: false
allowPrivilegeEscalation: true
capabilities:
add:
- NET_RAW # Required for IP spoofing
- NET_ADMIN
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
Step 2: Create RBAC to restrict which pods can get NET_RAW
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spoof-pod-creator
namespace: legacy-network-apps
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spoof-pod-binding
namespace: legacy-network-apps
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: spoof-pod-creator
subjects:
- kind: ServiceAccount
name: spoof-app
namespace: legacy-network-apps
Step 3: Configure reverse-path filter bypass for these pods
Use an init container to disable rp_filter inside the pod's network namespace:
apiVersion: v1
kind: Pod
metadata:
name: legacy-app
namespace: legacy-network-apps
annotations:
requires-spoofing: "true"
spec:
serviceAccountName: spoof-app
initContainers:
- name: disable-rp-filter
image: busybox:latest
command:
- /bin/sh
- -c
- |
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.default.rp_filter=0
securityContext:
privileged: true
containers:
- name: app
image: your-legacy-app:latest
securityContext:
capabilities:
add:
- NET_RAW
- NET_ADMIN
runAsUser: 1000
Step 4: Network policy: Isolate spoofing pods
Even with spoofing enabled, restrict their network access to prevent lateral movement:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: isolate-spoof-pod
namespace: legacy-network-apps
spec:
podSelector:
matchLabels:
app: legacy-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: external-networks
ports:
- protocol: UDP
port: 53 # DNS only
- to:
- ipBlock:
cidr: 10.20.0.0/16 # Specific destination for packet shaping
Step 5: Verify spoofing capability
kubectl exec legacy-app -- cat /proc/sys/net/ipv4/conf/all/rp_filter
# Should return: 0 (disabled)
Test IP spoofing:
kubectl exec legacy-app -- scapy
>>> from scapy.all import IP, send
>>> send(IP(src="192.168.1.100", dst="10.0.0.1")/ICMP())
# Should send packets with spoofed source IP
Step 6: Monitoring & Alerting
Log spoofing activity for compliance:
kubectl exec legacy-app -- tcpdump -i eth0 'ip.src != $(hostname -i)' -l | tee /var/log/spoofed-packets.log
Alert if spoofing pod sends traffic to unexpected destinations:
- alert: UnauthorizedSpoofedTraffic
expr: rate(egress_packets{pod_label_requires_spoofing="true",destination_namespace!="external-networks"}[5m]) > 100
annotations:
summary: "Spoofing pod {{$labels.pod_name}} sending traffic outside approved range"
Security audit trail:
kubectl get events -n legacy-network-apps | grep -E 'NET_RAW|privileged'
kubectl audit logs | grep 'legacy-network-apps' | jq '.requestObject.spec.securityContext'
Follow-up: How would you monitor for unauthorized IP spoofing attempts across your cluster? Design a detection system that flags suspect network activity.
You've deployed Cilium in a cluster with thousands of pods. After a week, you notice pod-to-pod DNS queries are failing intermittently (1-2% of requests). The issue is DNS resolution timing out. Your infrastructure team says "Network is fine, check your CNI." Cilium's DNS proxy might be the culprit. How do you debug and fix this?
Cilium includes a DNS proxy for security and observability. If it's misconfigured or overloaded, DNS queries timeout and pods can't reach services by hostname.
Diagnosis:
1. Verify DNS is failing:
kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- nslookup kubernetes.default
# Try multiple times: some succeed, some timeout?
Should show: dns-proxy-enabled=true (or similar)
Look for: cilium_dns_queries_total, cilium_dns_failures_total
Root causes (most common):
1. DNS proxy is CPU-saturated (high load, small replicas)
2. Upstream DNS server (kube-dns/coredns) is slow
3. DNS query caching is misconfigured
4. DNS proxy pod is overloaded with too many queries
Fix 1: Increase Cilium DNS proxy resources –limits=cpu=1000m,memory=1Gi
kubectl set resources daemonset/cilium -n cilium \
–requests=cpu=500m,memory=512Mi
Fix 2: Enable DNS query caching:
kubectl exec -n cilium cilium-xxxxx – cilium config
–set dns-cache-enabled=true
–set dns-cache-min-ttl=300
–set dns-cache-max-ttl=86400
Fix 3: Monitor upstream DNS performance:
kubectl run dns-perf-test --image=alpine -it --rm –
time nslookup kubernetes.default
Measure query time, should be <100ms
If upstream is slow, scale coredns:
kubectl scale deployment coredns -n kube-system --replicas=3
kubectl get deployment coredns -n kube-system
Prevention:
- alert: DNSQueryLatency
expr: histogram_quantile(0.95, cilium_dns_query_latency_seconds_bucket) > 0.5
for: 5m
annotations:
summary: "DNS queries p95 latency > 500ms"
- alert: DNSProxyErrors expr: rate(cilium_dns_failures_total[5m]) > 0.01 annotations: summary: "DNS proxy error rate {{ $value }}/sec"
Follow-up: How would you troubleshoot DNS resolution if you suspect the problem is with the application's DNS client (retry behavior, timeout settings) vs. the CNI?