Production scenario: Your Kubernetes API server latency jumped from 80ms to 5+ seconds overnight. Metrics show etcd has 50K active watches. Your observability team says CPU is pegged on the etcd leader node. You're on-call at 3 AM.
First, I'd verify etcd health and watch load: etcdctl --endpoints=https://etcd-0.etcd:2379 --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status and etcdctl alarm list. High watch count means API server clients (controllers, webhooks, kubectl watch) are streaming changes—each watch is a gRPC connection consuming memory and CPU on etcd leader.
Root causes: (1) Unfiltered watches—a controller watching all pods in all namespaces; (2) Network partition causing watch reconnects; (3) Slow consumer—controller processing events slowly, etcd buffers backlog in txn. Check kube-apiserver flags: --max-watch-connection-memory=watch_cache_size*10. If watch cache fills, API server drops connections.
Immediate mitigation: Check kube-apiserver logs for "unable to record watch event" or "message length too large"—signs of watch buffer overflow. Scale etcd horizontally (add peer) or increase --max-request-bytes=52428800 on etcd to handle larger events. For active controllers: add field selectors or namespace scopes to watch—fieldSelector=metadata.namespace=prod reduces events by 95%.
Chronic fix: Enable audit logging on API server to identify the problematic controller: --audit-policy-file=/etc/kubernetes/audit-policy.yaml with event level for watch requests. Set --watch-cache=true and tune --watch-cache-size=4096 per resource type (default 100 for pods). Monitor with apiserver_watch_events_total and etcd_server_has_leader metrics.
Follow-up: A webhook is processing events at 1 event/sec but you have 10K pod creates/sec. How would you redesign this to not tank etcd?
Your team deployed a new monitoring system that queries the API every 5 seconds for all pods, all nodes, and all endpoints across the cluster. After 10 minutes, the API server becomes unresponsive and developers report connection refused errors.
This is a cascading watch connection pool exhaustion. Each kubectl get pods --all-namespaces --watch (or list then watch cycle) consumes a connection. If your monitoring tool spawns 100 workers querying every 5 seconds, that's 100+ concurrent connections into the watch cache layer. When connection pool saturates (default 10K for API server), new connections hang in SYN_RECV.
Check kube-apiserver metrics: apiserver_client_certificate_expiration_seconds, apiserver_audit_event_total, and key diagnostic: apiserver_request_duration_seconds_bucket—if LIST requests are slow (>1s), watch cache is thrashing. SSH to API server and run ss -tan | grep ESTABLISHED | wc -l—if near 10K, you've hit the limit.
Emergency: Restart kube-apiserver with increased --max-requests-inflight=2000 --max-mutating-requests-inflight=1000 (default 400/200). For the monitoring tool: implement exponential backoff + jitter on reconnect. Use field selectors: ?fieldSelector=status.phase=Running to filter client-side. Better: switch from watch to informer-based caching—use client-go's cache.NewSharedIndexInformer() which batches events and respects backpressure.
Long-term: Implement client-side rate limiting (client-go has flowcontrol.NewBackoffManager). Monitor apiserver_request_total{verb="WATCH"} per source IP to catch offenders. Enable API Priority and Fairness (APF): --feature-gates=APIPriorityAndFairness=true with flow control rules to isolate monitoring from critical workloads.
Follow-up: You've enabled APF. How would you configure a FlowSchema and PriorityLevelConfiguration to prevent monitoring queries from starving critical controllers?
Your API server shows watch events are taking 200ms to propagate from etcd to client. Developers say kubectl describe takes 3 seconds for immediate consistency. You trace it to etcd compaction pausing compactor for 40 seconds every hour.
etcd watch latency is driven by the etcd rev-to-disk cycle: writes commit to raft log, get applied to state machine, notified to watchers. If compaction blocks this (etcd locks the state machine during compaction), watch latency spikes. Check etcd logs: grep "compaction started" etcd.log | tail -5. Verify with etcd metrics: etcd_server_slow_apply_total and etcd_disk_backend_compaction_pause_duration_seconds.
Etcd compaction cleanup: default runs every 5 minutes and can take 30-60s on large databases. Tune --auto-compaction-retention-v3=1h (cleanup every hour, keep 1hr history) to reduce compaction frequency. Monitor db size: etcdctl endpoint status shows DB size. If >5GB, compaction can't keep up—enable defrag on off-peak: etcdctl defrag --endpoints=https://etcd-0:2379 to reclaim space.
Watch propagation: etcd has internal gRPC connection limits. Check --grpc-keepalive-min-time=5m on etcd—if watchers send keepalive before this, connection resets. Set to 10s for faster keepalive and better watch responsiveness. On API server side, enable --etcd-compaction-interval=5m --etcd-count-metric-poll-interval=1m to poll compaction progress and avoid cache invalidation during compaction. For immediate relief: temporarily disable auto-compaction --auto-compaction-retention-v3='' (manual compact only), run defrag, re-enable.
Follow-up: Defrag succeeds but database grows from 500MB to 2GB in a week. What pattern is causing this and how would you investigate etcd key-value distribution?
A new custom controller in your cluster creates a watch on ConfigMaps with a label selector. After 1 week, the controller starts crashing with OOMKilled. etcd shows 100K active watches from this controller (one per pod somehow). The controller only has 1 replica.
The controller is creating watches in a tight loop without cleanup or deduplication. Each watch() call adds a listener to etcd; if not properly closed, they accumulate. Check the controller code for defer cancel() missing on context or watcher.Close() not called. If using client-go's Informer, verify informer.Run(stopCh) and close(stopCh) are properly paired on shutdown.
Diagnosis: SSH to the controller pod and check open file descriptors: lsof -p $(pidof controller-name) | wc -l—if 100K, you have a resource leak. Check controller logs for panic/restart loop; if restarting every 10s, it's opening 10K watches per restart. Grab a heap profile: go tool pprof http://localhost:6060/debug/pprof/heap (requires pprof enabled)—look for grpc connection buffer size dominating heap.
Immediate fix: Add defer cancel() in every watch context path. Patch controller to check if watch already exists before creating new one (deduplicate by (namespace, name)). Restart controller, monitor: ps aux | grep controller-name should stabilize memory. For etcd: cancel zombie watches with etcdctl lease revoke $(etcdctl lease grant 1 | cut -d' ' -f2)—force-close stale watches by revoking leases they hold.
Long-term: Add watch metrics to controller: prometheus.NewGaugeVec("active_watches", []string{"resource"}). Set alert: active_watches > 100. Use structured tracing (jaeger) to trace watch lifecycle. Enforce tests: mock etcd watch to ensure Close() always called in test coverage.
Follow-up: The leak is in an informer factory. How does DefaultResyncPeriod interact with informer cache refresh and could it trigger duplicate watches?
Your API server watch events contain 10MB objects (large ConfigMaps with binary data). Network shows 40% packet loss between API server and etcd client watching from a pod. Watchers start failing with connection reset by peer.
Large watch events (10MB) over unreliable network cause TCP segment loss and timeout. Each watch event is an etcd Event proto streamed via gRPC. If the pod's egress path has packet loss, TCP retransmit window fills, connection times out. Check pod network: kubectl exec pod -- ping -i 0.01 etcd-service | grep loss. If loss >5%, investigate: network policy, CNI plugin (Calico/Cilium), node network path.
etcd side: default --grpc-keepalive-time=2h means no keepalive for 2 hours—connection dies silently. Lower to 30s. Also set --grpc-keepalive-timeout=10s. For large events, check --max-request-bytes=52428800 (50MB)—sufficient but ensure client honors it. Increase TCP window: etcdctl --dial-timeout=30s --command-timeout=30s on client to allow retransmit time.
Solution: Compress watch events on etcd with --auto-compaction-mode=periodic (cleanup old revisions faster so large events don't accumulate). On client (pod): implement watch reconnect logic with exponential backoff. Better: don't send 10MB in a single event. Refactor ConfigMap to split into smaller pieces (<1MB each) or use CRD with subresources. If binary data required, store it in S3 and reference URL in ConfigMap (Kubernetes pattern for large data).
Network policy check: kubectl describe networkpolicies -n namespace—verify no egress filters dropping packets. Enable MTU probing: adjust pod network MTU if fragmentation occurring: ip link set dev eth0 mtu 1400 (inside pod) to reduce segment size.
Follow-up: You split the ConfigMap into 10 pieces to reduce event size. Watch latency still 200ms. What else could be the bottleneck between etcd write commit and your client receiving the event?
You upgraded etcd from v3.4 to v3.5. Suddenly, API server watchers are receiving duplicate events for the same object update. Some watchers see the new value, others see stale. Consistency is broken.
etcd v3.5 changed watch event ordering. In v3.4, all watchers received events in etcd revision order (FIFO). v3.5 introduced watch multiplexing which batches events—if two updates happen in rapid succession to same key, you can get out-of-order delivery or duplicates if watcher reconnects mid-batch. Check etcd release notes: "watch event ordering" changes between versions.
Verify: Check watchers for events with same revision but different value. Use etcdctl to manually watch: etcdctl watch key, trigger an update, verify revision increments monotonically. If not, you've hit the bug. Check etcd version: etcdctl version. Review watch logic in client code: ensure you're deduplifying by (key, revision) not just key.
Fix: Upgrade etcd to v3.5.14+ (recent patch), which fixed watch reordering. During upgrade, do rolling restart of etcd cluster: take 1 member at a time, upgrade binary, restart. Ensure quorum (3-member cluster: upgrade 1, wait for leader election, repeat). Monitor: etcd_server_has_leader should always be 1.
Workaround if can't upgrade: Enable --unsafe-no-fsync=false (default) on etcd—ensures writes durably sync before events sent, reducing reorder window. On watcher: add dedup cache seen_revisions = set(), skip duplicate revisions. Implement version vector: track (key, revision, timestamp) and sort events client-side by revision before processing state changes.
Follow-up: How would you test your watcher deduplication logic to handle out-of-order events from etcd? What chaos scenario would catch this bug in CI?
Your Deployment controller is watching Pods, ReplicaSets, and Deployments in parallel. It processes 100K watch events/sec but runs a full list-compare every 30 seconds to detect missed events. This list scan takes 5 seconds and blocks all watch processing, causing cascading reconciliation failures.
The periodic list-sync (resync loop) is a known pattern in operators to detect events that etcd watch may have dropped or that were missed during controller restart. But doing it synchronously blocks event processing. The controller should run resync on a separate goroutine with backpressure (bounded work queue).
Check client-go code: cache.NewSharedIndexInformer() has resyncPeriod parameter (default 0 = no resync, 24h typical in production). If resync running sync with watch handler, controller blocks. Refactor: separate workers for watch events and resync events into different queues. Use workqueue.NewRateLimitingQueue() with token bucket to limit resync to 10% of capacity.
Immediate: Increase resyncPeriod to 24h or disable (set 0) if your etcd watch is reliable. Verify watch reliability: monitor apiserver_watch_events_total{verb="WATCH"}—if stable and monotonic increasing, watch is reliable. If you need resync for correctness (custom resource tracking), run it in background: split into two informers (watch-heavy path on main thread, periodic cache refresh on worker pool with semaphore limiting to 1 concurrent scan).
Long-term: Implement exponential backoff for full list scan. First scan lists 100K objects in 1s, next 30s later lists 100K in 1s (time-bounded by setting scan deadline). If objects haven't changed, skip next scan—only resync if you detect missing watch events (implement counter: expected revisions vs. received revisions). Metrics: track controller_resync_duration_seconds to alert if resync takes >5s.
Follow-up: Resync is now async and takes 10 seconds. You detect a missed event at T=30s but the queue is full. How would you prioritize it over background resync events?
You're migrating to a new Kubernetes cluster and running dual API servers (old and new) for 1 week. Clients are split 50/50 between them. Both watch the same etcd. After 3 days, new cluster API server lags 5+ seconds behind old cluster on event delivery. etcd revision numbers match, but delivery is slow.
Both API servers connect to same etcd cluster and watch the same events. Latency difference means the new API server is slower at processing watch callbacks or its gRPC connection to etcd is saturated. Check: (1) CPU usage on new API server vs old—if new is 90%+ and old is 40%, new is bottlenecked. (2) Network: ethtool -S eth0 | grep drop on new server—if packets dropped, network bottleneck. (3) etcd connection pool: netstat -an | grep ESTABLISHED | grep etcd | wc -l on new server—if different from old, check API server concurrency flags.
Common cause: new API server has different --etcd-servers-overrides or --max-inflight-requests setting. Verify API server startup flags are identical: ps aux | grep kube-apiserver on both. If new has lower concurrency, increase --max-requests-inflight from default 400 to match old. Also check --etcd-compact-interval-default=5m—if different, etcd compact timing differs and new server may compact more frequently, blocking event delivery.
Investigation: Enable event tracing on new API server: --audit-policy-file and audit webhook to external collector. Trace individual watch events: request ID, received time, etcd apply time. Compare logs between servers to find the gap. Check API server metrics apiserver_etcd_requests_total{operation="Watch"}—if latency histogram p99 is high on new, there's internal processing bottleneck.
Solution: Match API server flags exactly (generate from template). Run load test: simulate old cluster traffic to new cluster and measure latency—if still slow, it's config issue, not load. After verifying both serve identically, gradually shift traffic to new. Monitor apiserver_request_duration_seconds_bucket{verb="WATCH",quantile="0.99"} to ensure parity before cutting over 100%.
Follow-up: New API server has identical flags but still lags. You notice it's using a different etcd endpoint (etcd-2 vs etcd-0). How would etcd member latency affect watch delivery speed, and how would you measure this?