Kubernetes Interview — eBPF and Cilium Dataplane

You migrated from kube-proxy (iptables mode) to Cilium as your cluster's CNI. Service connectivity works, but you need to explain to your infrastructure team what changed at the kernel level. Specifically: you're concerned about performance, latency, and debugging. Walk through how a packet from pod A to service B flows through each system.

kube-proxy (iptables mode) flow: Pod A sends packet to service IP (e.g., 10.0.0.1:443). Kernel routing table sends packet to node's iptables. kube-proxy has pre-programmed iptables rules (on every node): iptables -t nat -A KUBE-SERVICES -d 10.0.0.1/32 -p tcp -m tcp --dport 443 -j KUBE-SVC-xxx. This DNAT rule matches the service IP and jumps to another chain (KUBE-SVC-xxx). That chain has multiple targets (one per endpoint): -j KUBE-EP-yyyy selected round-robin. KUBE-EP-yyyy NATs the packet to the actual pod IP (e.g., 10.1.2.3:443). Packet goes out, return packet is SNATed back through the same chains. Latency: every packet traverses 4-5 iptables rules (PREROUTING -> KUBE-SERVICES -> KUBE-SVC -> KUBE-EP -> conntrack), 50-100 microseconds per packet.

Cilium (eBPF mode) flow: Pod A sends packet to service IP. Kernel receives packet at network interface. Instead of iptables rules, Cilium has loaded eBPF programs into the kernel: tc filter add dev eth0 ingress bpf da object cilium.o section from-container. eBPF program runs immediately at packet arrival (microsecond latency). The program: (1) Parses packet headers (L3/L4). (2) Looks up service in BPF map (fast in-kernel hash map). (3) Selects endpoint (same round-robin as iptables). (4) Modifies packet headers (DNAT source, dest IPs) in-kernel, without syscall. (5) Returns packet to forwarding pipeline. Total latency: 5-10 microseconds (10x faster than iptables). No conntrack overhead.

Key differences: (1) Processing location: iptables runs in userspace connector (netfilter hooks), Cilium runs in kernel (eBPF JIT compiled to native CPU). (2) Lookup speed: iptables traverses chains linearly (4-5 chain jumps), Cilium uses BPF map (O(1) hash table). (3) Debugging: iptables rules visible with iptables -L -n, Cilium requires cilium bpf service list or bpftool map show. (4) Connection tracking: iptables uses netfilter conntrack (256KB-4MB per connection, scales poorly), Cilium tracks in userspace (less memory, better scaling).

Practical debugging: To trace packet path in iptables: iptables -t nat -L -v shows packet counters per rule. Increment means packet matched that rule. For Cilium: cilium bpf service list shows all services. bpftool map dump name cilium_services shows in-kernel mapping. To see eBPF programs loaded: bpftool prog list. To trace packet in eBPF: use eBPF kprobes: echo 'p:kprobe_eth0_rx /sys/kernel/debug/tracing/kprobe_events' && cat /sys/kernel/debug/tracing/trace (shows packet processing trace).

Follow-up: Cilium is faster, but you notice a spike in TCP retransmits after migration. Latency is now 100 microseconds (not 10). What would cause this and how would you identify if it's eBPF-related?

After deploying Cilium, your application sees occasional packet drops (0.1%) during network policy enforcement. TCP connections timeout and retry. The drops only happen for inter-pod traffic, not external traffic. Your team suspects eBPF is dropping packets incorrectly.

eBPF packet drop in Cilium usually means: (1) NetworkPolicy is rejecting traffic, or (2) eBPF program hit a limit (memory, verifier complexity) and fell back to slow path, or (3) Kernel bug in eBPF JIT (rare).

Diagnosis step 1: Check if NetworkPolicy is causing drops. Cilium enforces both Kubernetes NetworkPolicy and Cilium CiliumNetworkPolicy. See active policies: cilium policy list or kubectl get cnp -A. For each policy, count drops: cilium metrics list | grep policy_verdict_denied (metrics show denied packets). If this counter is high, a policy is rejecting traffic.

Step 2: Identify which policy. Check policy selectors: kubectl get cnp -n namespace -o yaml | grep -A 20 "fromEndpoints\|toEndpoints". Use cilium endpoint list to see which pods match the policy selectors. Use cilium policy trace to trace a packet: cilium policy trace -s pod1 -d pod2. Output shows: "packet allowed/denied by policy X".

Step 3: Verify packet actually dropped vs. delayed. Use ethtool -S eth0 | grep drop on node to check kernel-level drops. If counter doesn't increase, drops are ephemeral (maybe retransmits on timeout). Check application logs for connection timeouts. If timeout happens but no packet drop, issue is latency (eBPF processing overhead), not policy.

If latency: Profile eBPF program. Cilium adds latency tracking to eBPF (optional). Enable: cilium config set --patch '{"metrics":"policy,drop,trace"}'. Run Prometheus query: rate(cilium_bpf_program_run_time[1m])—shows average eBPF execution time per program. If >1ms, program is too complex (possible). Check eBPF program size: bpftool prog list | grep cilium—if size >1MB, program is huge and may hit kernel verifier limits.

Solution: (1) Simplify NetworkPolicy: remove overlapping or redundant policies. (2) Enable eBPF tail calls to reduce program size: cilium config set bpf-policy-map-max=16384 (default 8192). (3) Update Cilium: older versions had performance bugs. Check release notes: cilium version and compare to latest stable. (4) Increase kernel verifier complexity limit: echo 1000000 > /proc/sys/kernel/bpf_stats_enabled (allows larger programs).

Follow-up: You've simplified policies and updated Cilium. Drops still occur (0.05%, lower but persistent). You suspect memory allocation in eBPF. How would you trace BPF memory allocations and find leaks?

Your cluster runs Cilium with BPF-based service load balancing. You observe that service endpoints aren't being updated: a pod is deleted but kube-proxy still routes traffic to it (which fails). Cilium is not picking up the pod deletion for 3+ minutes after kubectl delete pod.

Cilium watches Kubernetes API for pod changes and updates its BPF maps dynamically. If pod deletion is delayed in Cilium, there's a lag in the watch -> update -> BPF sync pipeline. Check Cilium's sync status: cilium service list and manually verify against kubectl get endpoints service-name. If Cilium service endpoints don't match kubectl get endpoints, Cilium's cache is stale.

Root cause analysis: (1) Cilium agent not receiving pod deletion event. Check cilium logs: kubectl logs -n kube-system -l k8s-app=cilium -c cilium-agent | grep -i "delete\|endpoint\|update" | tail -50. Look for errors like "unable to sync endpoint" or "watch timeout". (2) API server isn't sending watch events. Check kube-apiserver logs for errors: kubectl logs -n kube-system -l component=kube-apiserver | grep -i "watch\|endpoint" | tail -20. If many "watch connection reset", API server connectivity is flaky.

Verify connectivity: From cilium-agent pod, test API connectivity: kubectl exec -it -n kube-system cilium-xxxxx -c cilium-agent -- bash -c "curl -k https://kubernetes.default.svc:443/api/v1/watch/pods". Should get stream of events. If connection refused or timeout, cluster network issue (CNI itself is broken, circular dependency).

Solution: (1) Restart cilium-agent: kubectl rollout restart daemonset/cilium -n kube-system. This re-establishes watch connections and re-syncs all endpoints. (2) Force resync: cilium bpf endpoint list | wc -l should match actual pod count. If not, trigger full resync: kubectl delete pod -n kube-system -l k8s-app=cilium (restarts agents, full sync). (3) If still lagging, increase cilium resync interval: cilium config set --patch '{"cilium-endpoint-gc-interval":"5s"}' (default 5m, reduces stale endpoints in BPF). (4) Check for API rate limiting: cilium config get | grep rate-limit. If rate limiting is aggressive, increase: cilium config set --patch '{"client-side-rate-limit": "100"}'.

Debugging the BPF map directly: bpftool map dump name cilium_endpoints shows all cached endpoints. Compare keys to kubectl get endpoints. If BPF has stale entries, Cilium is not calling BPF delete. Check for error logs in cilium-agent showing BPF update failures: kubectl logs -n kube-system -l k8s-app=cilium -c cilium-agent | grep -i "bpf.*error".

Follow-up: You've restarted Cilium agents and endpoints sync correctly now. But during peak traffic, you see stale endpoints come back (deleted pods receiving traffic). Why would this happen intermittently?

Your Cilium cluster has NetworkPolicy enforcing strict ingress rules. But you notice a pod that should be blocked is somehow able to send traffic outbound (egress) to pods in another namespace. The NetworkPolicy has explicit deny-all egress, but the pod violates it. No Cilium errors in logs.

NetworkPolicy enforcement can be bypassed if: (1) Policy is not applied to the pod (label mismatch), (2) Policy is applied but eBPF has stale rules, or (3) Egress is happening through a different network interface (host network, service mesh bypass).

Check pod labels: kubectl get pod pod-name -o yaml | grep labels. Compare to NetworkPolicy selector: kubectl get networkpolicy policy-name -o yaml | grep podSelector. If pod labels don't match selector, policy won't apply. Example: pod has tier: frontend but policy expects app: web. Pod is exempt from the policy.

Verify policy is enforced: cilium endpoint list -o wide | grep pod-name. Shows which policies are active on the pod. If the denying policy is not listed, it's not applied. Update policy selectors: kubectl patch networkpolicy policy-name -p '{"spec":{"podSelector":{"matchLabels":{"tier":"frontend"}}}}' (add matching labels).

Check for stale eBPF rules: eBPF rules are compiled into the kernel. If policy is updated but eBPF cache not invalidated, old rules stick around. Restart cilium-agent to flush: kubectl delete pod -n kube-system -l k8s-app=cilium. Or manually flush: cilium bpf policy flush (if available in your version).

Check pod networking mode: kubectl get pod pod-name -o yaml | grep hostNetwork. If hostNetwork: true, pod is on host network and NetworkPolicy doesn't apply (traffic goes through host iptables, not Cilium). Change to false if not needed: kubectl set env deployment/myapp POD_HOST_NETWORK=false (and redeploy).

Check for service mesh bypass: If pod uses Envoy sidecar (Istio), sidecar may intercept traffic. Check sidecar injection: kubectl get pods pod-name -o yaml | grep sidecar. If sidecar present, service mesh policies override Cilium NetworkPolicy. Verify in Istio: kubectl get authorizationpolicies -A. Ensure Istio policy aligns with Cilium.

Egress observation: To confirm egress violation, trace the packet from pod: kubectl exec pod-name -- tcpdump -i eth0 -n "dst net 10.1.0.0/16" (show outbound traffic to other namespace). If traffic appears but policy denies, confirm traffic was actually sent: check destination pod for corresponding inbound connection: kubectl exec dest-pod -- tcpdump -i eth0 -n "src net pod-name-ip". If dest pod doesn't see it, traffic was dropped at some other layer.

Follow-up: Pod has correct labels, policy is active, but egress still works. You trace and confirm traffic reaches the destination pod. The only thing different: pod is in DaemonSet (not Deployment). Could this matter?

You're using Cilium's identity-based security (instead of IP-based rules). Cilium assigns each pod a security identity and policies match on identity. After scaling a deployment from 1 to 100 replicas, you notice some requests fail sporadically. Cilium logs show "identity mismatch" or "unknown identity" errors for new replicas.

Cilium's identity system: Each pod gets a unique identity (ephemeral ID based on labels). When policy rules reference "pods with label app=web", Cilium converts this to identity ID 1234. Packets from identity 1234 are allowed through. Identity assignment happens in Cilium agent (runs on every node).

When scaling up: New pods are created. Cilium agent on the node receives pod creation event. Agent assigns the new pod an identity by computing labels. This normally takes <100ms. But if agent is slow or overloaded, identity assignment can be delayed or race. If a packet is sent before identity is assigned, Cilium sees "unknown identity" (0 or unset) and drops the packet.

Diagnosis: Check Cilium agent load: kubectl top pod -n kube-system -l k8s-app=cilium --containers=cilium-agent. If CPU/memory near limits, agent is bottlenecked. Check Cilium logs for identity allocation errors: kubectl logs -n kube-system -l k8s-app=cilium -c cilium-agent | grep -i "identity\|allocat" | tail -20. Look for "unable to allocate identity" or delays.

Verify identity assignment: List identities: cilium identity list | grep -A 2 "app=web". New replicas should all have the same identity as old replicas. If they have different identities (e.g., 1234 vs 5678), policy is mismatched. Check why: kubectl get pods -n default -o wide -L app. New pods must have same labels as old pods.

Solution: (1) Increase Cilium agent resources: kubectl set resources daemonset/cilium -n kube-system --limits=cpu=2000m,memory=2Gi --requests=cpu=1000m,memory=1Gi. Faster agent processes pods quicker. (2) Configure identity cache: cilium config set --patch '{"identity-cache-size": "10000"}' (default 5000). More room in cache reduces stalls. (3) Implement pod disruption budget: kind: PodDisruptionBudget, minAvailable: 1 to prevent all pods from scaling simultaneously (smoother scale-up).

Test scaling: Scale deployment: kubectl scale deployment web --replicas=100. Immediately check identity: cilium identity list | grep app=web | wc -l should show 100 entries. If <100, identity assignment is lagging. Check endpoint status: cilium endpoint list -o wide | grep web should show all 100 pods as "ready".

Debug identity races: Enable tracing: cilium debug trace | grep identity (real-time trace of identity operations). You'll see "assigning identity X to pod Y" events. If assignment is slow, check Cilium agent logs for reasons (API server delay, etcd lookup slow, etc.).

Follow-up: Scaling works now. But you notice identity allocation takes 2 seconds (very slow). Meanwhile, older pods have instant (0ms) identity allocation. What's different and how would you optimize?

Your Cilium cluster enforces L7 (application layer) policies: HTTP requests to /admin are denied, but /api/users are allowed. After upgrading Cilium, all HTTP traffic is denied (even /api/users). L4 policies (port-based) still work. No error messages in Cilium logs.

L7 policy requires deep packet inspection (DPI). Cilium has eBPF programs at L4 (TCP/UDP port matching) and L7 layers (HTTP header matching, gRPC metadata, etc.). L7 policies are complex eBPF: they parse HTTP request lines, extract paths, and match against rules. If all L7 traffic is denied but L4 works, the L7 eBPF program is blocking everything (probably returning "drop" as default instead of "allow").

Root cause: Cilium upgrade may have changed policy semantics. Check Cilium version: cilium version. Between versions, L7 policy matching rules changed. Example: Cilium 1.12 used toPorts: [{ports: ["8080"], l7Rules: [http: [{path: "/api/*"}]]}]. In 1.13+, syntax changed to l7Proto: "http", l7: [{path: "/api/*"}]. Old policy format is silently ignored, falling back to default (deny).

Verify policy syntax: Check policy YAML: kubectl get ciliumnp -A -o yaml | grep -A 5 "l7Rules\|l7Proto". Compare to upgrade documentation: https://cilium.io/blog/2023/01/cilium-1.13-release (check for breaking changes). Re-write policy in new syntax if needed.

Verify L7 proxy is enabled: L7 policies require Cilium's L7 proxy daemon running. Check: cilium policy get | grep "l7.*proxy" or ps aux | grep "envoy\|proxy" in cilium container. If proxy not running, L7 policies are silently dropped. Enable: cilium config set --patch '{"enable-l7-proxy": true}'. Restart agents: kubectl rollout restart daemonset/cilium -n kube-system.

Debug L7 matching: Enable L7 audit logs: cilium config set --patch '{"log-level": "debug"}'. Restart agents. Then check logs: kubectl logs -n kube-system -l k8s-app=cilium -c cilium-agent | grep -i "l7\|http" | head -50. You should see HTTP request parsing logs. If not, L7 proxy isn't seeing traffic (possible traffic interception issue).

Verify traffic reaches L7 proxy: Use iptables -t nat -L -v | grep proxy on node (older Cilium versions use iptables for traffic redirection to proxy). Newer versions use eBPF. Check: bpftool prog list | grep -i l7. If no L7 programs loaded, L7 is disabled.

Workaround during incident: Rollback L7 policy to L4 temporarily: kubectl patch ciliumnp policy-name -p '{"spec":{"ingress":[{"fromEndpoints":[{"matchLabels":{"role":"client"}}],"toPorts":[{"ports":["8080"]}]}]}}' (remove l7Rules section). This allows L4-only matching (port 8080, no path restriction) until you fix L7 policy format.

Follow-up: L7 proxy is enabled and policies are in new syntax. HTTP traffic still denied. You enable debug logs and see "HTTP request parsing failed". What would cause HTTP parsing to fail in eBPF?

You're running Cilium with eBPF-based service load balancing in a hybrid cluster: 50% bare-metal nodes, 50% cloud VMs. Traffic between pods on bare-metal nodes works at 10Gbps, but traffic crossing cloud VMs is capped at 1Gbps. Cilium is installed identically on all nodes. You suspect eBPF is the issue.

eBPF performance depends on kernel JIT compilation. eBPF programs are compiled to native CPU instructions by the kernel's eBPF JIT. If JIT is disabled or slow, eBPF runs in interpreter mode (10-100x slower). Bare-metal nodes may have faster JIT or different CPU type supporting JIT, while cloud VMs have slower CPUs or JIT disabled.

Check JIT status on nodes: SSH to each node, run: cat /proc/sys/net/core/bpf_jit_enable. Value: 0 = disabled (interpreter), 1 = enabled (native), 2 = enabled + debugging. Cloud VMs likely show 0 or 1 (limited). Enable JIT: echo 1 > /proc/sys/net/core/bpf_jit_enable. For permanent: echo "net.core.bpf_jit_enable=1" >> /etc/sysctl.conf && sysctl -p.

Also check kernel version: uname -r on each node. Older kernels (pre-5.10) have slower eBPF JIT or don't support certain instructions. Cloud VMs might be running older kernel versions. Upgrade: apt-get install linux-image-generic, then reboot.

Verify CPU frequency scaling: Cloud VMs may have CPU frequency capping (power saving). Check: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq. On bare-metal, it's often maximum CPU freq (e.g., 3.8GHz). On cloud VMs, may be capped to 2.4GHz. Increase: echo powersave > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (trades power for performance).

Verify Cilium eBPF program is same: bpftool prog show | grep -i cilium. Compare program size (bytes) between bare-metal and cloud nodes. If sizes differ, different program version deployed. Check Cilium daemonset version: kubectl get daemonset cilium -n kube-system -o yaml | grep image. Should be identical on all nodes. If not, update to match.

Test eBPF performance directly: Run iperf between pods on bare-metal, then between pods on cloud VM: kubectl run iperf-server --image=networkstatic/iperf3 -- iperf3 -s on one node, kubectl run iperf-client --image=networkstatic/iperf3 --node-selector=cloud -- iperf3 -c iperf-server on cloud node. If bandwidth is 1Gbps vs 10Gbps, eBPF throughput is limited. Profile: cilium metrics list | grep pps (packets per second)—if lower on cloud, bottleneck confirmed.

Cilium-specific tuning: Increase eBPF map sizes on cloud nodes: cilium config set --patch '{"bpf-map-size": "1000000"}' (default 64k). Larger maps reduce hash collisions (faster lookup). Also tune eBPF tail calls: cilium config set --patch '{"enable-bpf-tail-calls": true}' to allow eBPF programs to call each other efficiently.

Follow-up: JIT is enabled on all nodes. CPU freq is high. You still see 1Gbps throughput on cloud VMs. You notice cloud VMs have virtio networking (virtual device), while bare-metal use physical NICs. Could this be the issue and how would you measure NIC throughput independent of Cilium eBPF?

Your Cilium deployment uses eBPF kprobes to monitor system calls. After running for 3 weeks, kernel memory usage on nodes suddenly spikes 40% (from 800MB to 1.1GB). Cilium agent is still healthy, but kernel is running out of memory. You suspect eBPF memory leak.

eBPF programs consume kernel memory for: (1) BPF maps (hashtables storing state), (2) verifier buffers (during program load), (3) tracing buffers (kprobes/uprobes output). If memory spikes after 3 weeks, it's likely a leak in one of these areas.

Diagnosis: Check kernel memory usage per eBPF component. Use: cat /proc/meminfo | grep Vmem (vmalloc area) and bpftool map show (total BPF map memory). For each map, check usage: bpftool map dump name map-name | wc -l (entry count). If count grows over time (without corresponding pod activity), it's a leak.

Common causes: (1) BPF map not cleaning up old entries. Example: Cilium stores per-connection state in map. If connections aren't properly closed, entries accumulate. Check: cilium bpf connection list | wc -l. If 100K+ connections when only 100 pods exist, connections are leaking. (2) Kprobe buffer overflow: Cilium traces system calls. If tracing output is too verbose, ring buffer fills and memory allocated to queue events. Check: cat /sys/kernel/debug/tracing/instances/*/trace | wc -l. (3) Verifier memory: rarely an issue (verifier memory freed after program load), but possible if programs are reloaded frequently.

Solution 1 - Reduce map sizes: Tune BPF map capacity: cilium config set --patch '{"bpf-policy-map-max": "8192"}' (from default 16384). Smaller maps prevent unbounded growth, but may start dropping entries if full. Monitor: bpftool map show | grep "cilium_" | awk '{print $NF}' | paste -sd+ | bc (sum all Cilium map memory).

Solution 2 - Enable map entry expiration: Some maps support TTL (time-to-live). Configure: cilium config set --patch '{"bpf-map-ttl": "300s"}' (entries expire after 5 minutes). This prevents indefinite accumulation. Check which maps support TTL: bpftool map show | grep -i ttl.

Solution 3 - Reduce tracing verbosity: Disable expensive kprobes: cilium config set --patch '{"tracing-enabled": false}' or selectively: cilium config set --patch '{"trace-syscalls": false}'. This reduces trace buffer memory consumption.

Solution 4 - Restart Cilium agents: Force reload all eBPF programs (clears verifier buffers and transient state): kubectl rollout restart daemonset/cilium -n kube-system. Before restart, check kernel memory: free -h | grep Mem. After restart, monitor memory drop (should be immediate if leak is in verifier buffers).

Long-term: Monitor BPF map memory continuously: prometheus query: bpf_map_memory_bytes{instance="node1"}. Alert if memory > 1GB. Use Cilium metrics: cilium metrics list | grep memory for application-level memory tracking. Also audit Cilium upgrade history—memory leaks are often fixed in newer versions. Check: cilium version vs. latest stable release (check GitHub releases).

Follow-up: You've restarted Cilium and memory returns to normal. But after 1 week, memory spikes again. You find a Cilium CiliumNetworkPolicy with 10K rules (accidentally created by script). How does policy complexity affect eBPF memory?