AWS Interview — EKS Architecture: Control Plane and Data Plane

Your EKS managed node group has a max size of 100 nodes. During a traffic spike, Kubernetes Cluster Autoscaler attempts to scale up but stops at 50 nodes. Pods remain in Pending state. The node group reaches 50 nodes and doesn't scale further. Explain why and how to fix.

The issue is likely one of these limits: (1) EC2 vCPU limits — AWS accounts have per-instance-type vCPU limits. If your node instances use 8 vCPU each and you request 100 nodes, that's 800 vCPU. If your account limit is 96 vCPU, autoscaling stops at 12 nodes. Check limits: `aws service-quotas list-service-quotas --service-code ec2 --query 'ServiceQuotas[?contains(QuotaName, `Running`)]'`. Request limit increase: `aws service-quotas request-service-quota-increase --service-code ec2 --quota-code L-1216C47A --desired-value 160` (for c5.2xlarge instances, for example). (2) VPC subnet capacity — if nodes are launched in subnets with only 50 available IP addresses, autoscaling stops at 50. Check subnet CIDR block size: `aws ec2 describe-subnets --subnet-ids subnet-xxxxx | jq '.Subnets[0].AvailableIpAddressCount'`. If low, create additional subnets or expand existing ones (recreate with larger CIDR). (3) ASG max size misconfiguration — although you said max is 100, verify with `aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg | jq '.AutoScalingGroups[0].MaxSize'`. (4) IAM permissions for Cluster Autoscaler — verify the autoscaler has permissions to call `ec2:DescribeInstances`, `autoscaling:DescribeAutoScalingGroups`, etc. Check the IAM role: `aws iam get-role --role-name eks-autoscaler-role | jq '.Role.AssumeRolePolicyDocument'`. (5) Cluster Autoscaler not keeping up — sometimes autoscaler detects Pending pods but takes time to scale. Check autoscaler logs: `kubectl logs -n kube-system -l app=cluster-autoscaler --tail 50`. Look for errors like "Failed to increase node group" or "Hitting InstanceLimitsExceeded". Most common: vCPU limit. Request a quota increase immediately: AWS typically approves these within hours. For production, set vCPU limit much higher than peak expected nodes.

Follow-up: You increase vCPU limits to 1000. Node group still scales only to 50 nodes. Cluster Autoscaler logs show no errors. You check available IPs in subnets — plenty available (1000+). What else?

You're running a stateless API workload on EKS with Horizontal Pod Autoscaler (HPA) configured to scale based on CPU. During a traffic spike (2x normal), HPA scales pods from 10 to 50. But nodes don't scale fast enough. Pods are in Pending state for 3 minutes waiting for nodes. How do you improve this?

HPA scales pods faster than Cluster Autoscaler can provision nodes (typically 3-5 minute node launch time). This creates a window where pods are Pending. To reduce Pending time: (1) Use Karpenter instead of Cluster Autoscaler — Karpenter consolidates multiple ASGs into a unified resource pool and scales nodes in ~30 seconds instead of 3-5 minutes. It also scales down unused nodes faster. Deploy: `helm install karpenter oci://public.ecr.aws/karpenter/karpenter --namespace karpenter --create-namespace`. (2) Pre-provision nodes using reserved instances or Savings Plans — launch nodes before the spike to avoid the scaling delay. Not ideal if spikes are unpredictable. (3) Use Fargate (serverless) for burst capacity — when Pending pods would wait for nodes, Fargate launches them immediately without node provisioning. Configure a Fargate profile: `aws eks create-fargate-profile --cluster-name my-cluster --fargate-profile-name burst-profile --selectors namespace=default`. Fargate is more expensive but eliminates Pending time. (4) Increase HPA's target CPU threshold to reduce scale-up frequency — e.g., instead of scaling at 70% CPU, scale at 85%. This reduces unnecessary scaling and gives Cluster Autoscaler time to keep up. `kubectl patch hpa my-hpa -p '{"spec":{"targetCPUUtilizationPercentage":85}}'`. (5) Increase Cluster Autoscaler's scan interval or scale-down delay — `--scan-interval=10s --scale-down-delay-after-add=10m` (default is 10s scan, 10m delay). (6) Use Spot instances mixed with on-demand — Spot instances are cheaper but can be interrupted. Karpenter and cluster autoscaler can mix Spot and on-demand. Most effective for stateless APIs: Karpenter (30-second scaling) or Fargate (instant). Karpenter is recommended for production because it's more efficient and cost-effective than Cluster Autoscaler + multiple ASGs.

Follow-up: You deploy Karpenter and scale time drops to 30 seconds. Pending pod time is now acceptable. But your cluster has 80 nodes during peak, but only 10 during off-peak (costing $5000/month). How do you optimize for cost?

Your EKS cluster's control plane metrics show etcd database CPU at 90% utilization during normal operations (not spikes). The control plane is AWS-managed (you don't manage the master nodes). Workload performance is degrading, with API calls (kubectl get pods) taking 2-3 seconds. How do you diagnose and fix?

EKS control plane is managed by AWS, but you can monitor control plane performance via CloudWatch. etcd high CPU during normal operations suggests one of: (1) Excessive API requests — your applications or tooling are making too many API calls to Kubernetes. Check with `kubectl get events -A --sort-by='.lastTimestamp' | tail` and look for excessive watch streams (long-running connections that stream updates). Kubectl watch and controllers create watch streams; too many can overload etcd. (2) Large object count — if you have millions of objects (pods, deployments, configmaps) in the cluster, etcd has to index them all, consuming CPU. Check: `kubectl get all -A | wc -l`. If > 100K objects, consider namespace isolation or splitting into multiple clusters. (3) Uncompressed etcd backups or snapshots running — AWS automatically backs up etcd, but if backups coincide with high traffic, performance degrades. Check CloudWatch logs for backup timing: `aws logs describe-log-groups --log-group-name-prefix /aws/eks/my-cluster`. (4) Control plane instance type too small — EKS automatically scales control plane, but if it's at the smallest tier, it can't scale further. Check control plane tier with `aws eks describe-cluster --name my-cluster | jq '.cluster.logging.clusterLogging'`. (5) In-cluster add-ons consuming CPU — metrics-server, VPC CNI, autoscaler can generate excessive API calls. Check their logs: `kubectl logs -n kube-system -l app=metrics-server`. Fixes: (1) Reduce watch streams — use informer caches instead of watch endpoints in your applications. (2) Clean up unused objects — `kubectl delete pods --all -A --field-selector status.phase=Failed` removes failed pods clogging etcd. (3) Upgrade EKS cluster version — newer versions have better etcd optimization. (4) Use EKS Cost Optimization — split workloads across multiple smaller clusters. Most common: excessive API calls from metrics collection or old-running informers. Diagnose with: `kubectl top nodes` and `kubectl top pods -A --sort-by=memory | head -20`.

Follow-up: You check API requests and find that metrics-server is making 1000s of calls/sec to kubelet on each node, querying pod metrics. You scale down metrics collection, but etcd CPU drops only 30%. What else is consuming etcd CPU?

You're rolling out a new container image to your EKS deployment. Rolling update creates new pods with the new image while terminating old ones. During the update, some requests fail with 503 errors even though there are always pods ready to serve traffic. Explain the issue and fix.

The 503 errors during rolling updates are likely due to: (1) Service load balancer not updating quickly when pods terminate — when a pod is terminated, it enters a "Terminating" state for up to terminationGracePeriodSeconds (default 30 seconds). If the load balancer (ALB/NLB) doesn't remove the pod from its target group quickly, requests route to a terminating pod. Fix: Configure Pod Disruption Budgets (PDB) to ensure at least N pods are always ready: `kubectl create pdb my-pdb --selector app=my-app --min-available=2`. This ensures the rolling update never takes down more than N-2 pods at once. (2) New pods not ready before old pods terminate — if the new container takes 5 seconds to become ready (startup probe), but the rolling update terminates old pods after 3 seconds, there's a gap with no available pods. Fix: Set readinessProbe and configure update strategy: `spec.strategy.rollingUpdate.maxUnavailable: 0` and `maxSurge: 2`. This ensures at least N pods are running at all times. Example YAML: `spec: strategy: rollingUpdate: maxUnavailable: 0; maxSurge: 2; template: spec: readinessProbe: httpGet: path: /health; port: 8080; initialDelaySeconds: 2; periodSeconds: 2`. (3) Drain timeout too short — when scaling down a node, Kubernetes sends SIGTERM and waits terminationGracePeriodSeconds (default 30s) before force-killing. If the container takes > 30s to gracefully shut down, requests fail during shutdown. Increase: `spec.template.spec.terminationGracePeriodSeconds: 60`. (4) Connection draining — the service doesn't drain existing connections when removing a pod. For HTTP, the ALB has connection draining (deregistration delay), but it must be configured. Check ALB target group: `aws elbv2 modify-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:... --attributes Key=deregistration_delay.timeout_seconds,Value=30`. Best practice: Combine (1) PDB, (2) readinessProbe with fast startup, (3) increased terminationGracePeriodSeconds, (4) ALB connection draining. This ensures rolling updates are transparent to users.

Follow-up: You configure PDB, readinessProbe, and increase terminationGracePeriodSeconds to 60. Rolling updates still have 503s, but now less frequently. You inspect the ALB access logs and see requests hitting both old and new pod IPs. Why doesn't the rolling update phase out old pods completely?

You're using EKS with the VPC CNI plugin for pod networking. Pods are assigned IPs from your VPC subnets. You have 50 nodes, each supporting 30 pods = 1500 pods max. Your cluster has 1200 running pods with 800 pending. You have plenty of EC2 capacity (nodes aren't maxed), but pods still can't schedule. What's the bottleneck?

The bottleneck is IP address exhaustion in your VPC subnets, not compute capacity. EKS VPC CNI assigns one IP per pod. If your subnets have only 1500 IPs total and 1200 are in use, only 300 IPs are available. Pending pods need IPs but there's nowhere to assign them. Check available IPs: `aws ec2 describe-subnets --subnet-ids subnet-xxxxx | jq '.Subnets[0].AvailableIpAddressCount'`. If it's < number of pending pods, that's the issue. Solutions: (1) Expand existing subnet CIDR — this requires recreating the subnet (downtime). Not ideal for running clusters. (2) Add new subnets with larger CIDR blocks — `aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.1.0.0/24`. Ensure EKS is configured to use the new subnet. Update the EKS node group to use new subnets: `aws ec2 create-launch-template --launch-template-name my-template --version-description v1 --launch-template-data '{"NetworkInterfaces": [{"SubnetId": "subnet-yyyyy"}]}'`, then update ASG. (3) Use Windows IP Address Management (IPAM) — VPC IPAM can centrally manage IP allocation across VPCs and subnets. This is more complex but provides flexibility for large deployments. (4) Reduce pods per node — configure the VPC CNI to limit pods per node (e.g., 20 instead of 30). This frees up IPs but reduces cluster density. `aws eks create-nodegroup --cluster-name my-cluster --nodegroup-name my-ng --subnets subnet-xxxxx --max-size 100 --instance-types t3.large --desired-size 50 --instance-types m5.large --labels env=prod`. Add `--max-pods 20` in the launch template. (5) Use prefix delegation — VPC CNI can allocate IP prefixes (/28 blocks) instead of individual IPs, allowing 16 IPs per prefix. This scales pod density 16x. Enable: Set environment variable `ENABLE_PREFIX_DELEGATION=true` in the VPC CNI daemonset: `kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true`. Most practical: Add new subnets with larger CIDR blocks. For future clusters, size subnets to support 10x expected pod count to avoid this issue. A /20 subnet has 4096 IPs, supporting ~3000 pods (accounting for overhead).

Follow-up: You add new subnets with larger CIDRs and enable prefix delegation. IP exhaustion is resolved, and pods schedule immediately. But node utilization drops from 90% to 40%. Why?

Your EKS cluster uses Spot instances to reduce costs. During a workload update, AWS reclaims a Spot instance (2-minute warning). Pods on that instance are evicted. Your deployment has 3 replicas and 1 pod is on the Spot instance being reclaimed. That pod's termination blocks the rolling update for 2 minutes. How do you improve?

The issue is that the pod is being terminated by Spot reclamation, which is uncontrolled. When Spot sends the reclamation notice, Kubernetes evicts the pod, but the rolling update is waiting for all pods to become ready before proceeding, causing the 2-minute delay. To improve: (1) Use Pod Disruption Budgets (PDB) to ensure N pods are always available during disruptions: `kubectl create pdb my-pdb --selector app=my-app --min-available=2`. With 3 replicas and min-available=2, Kubernetes will not evict more than 1 pod at a time. If Spot reclaims, at most 1 pod is disrupted, and the remaining 2 handle traffic. (2) Configure Spot interruption handling with AWS Node Termination Handler — this gracefully shuts down pods when Spot sends a reclamation warning, ensuring time to migrate. Deploy: `helm install aws-node-termination-handler aws/aws-node-termination-handler -n kube-system --set instances="spot"`. When termination handler detects an interruption, it cordons the node (prevents new pods), drains existing pods (gives them 45 seconds to terminate), and the rolling update continues. (3) Use on-demand instances for critical workloads, Spot for bursty/resilient workloads. Mix instance types: configure node selector or taints/tolerations to schedule fault-tolerant workloads on Spot, critical ones on on-demand. Example: `nodeSelector: workload-type: fault-tolerant` for pods that tolerate Spot interruptions. (4) Configure pod anti-affinity to spread replicas across multiple nodes/AZs — if 1 pod is disrupted, others remain available. `affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: [{topologyKey: kubernetes.io/hostname, labelSelector: {matchLabels: {app: my-app}}}]`. Best practice: Combine (1) PDB, (2) Node Termination Handler, (3) anti-affinity, (4) mix Spot+on-demand. For cost savings, use Spot with proper resilience patterns; for critical workloads, use on-demand.

Follow-up: You implement PDB (min-available=2) and Node Termination Handler. Spot reclamation now gracefully terminates 1 pod, and the other 2 keep handling traffic. But during a rolling update, when you update the deployment image, the new pods take 10 seconds to become ready. Old pods are already disrupted by Spot, so there's a brief gap. How do you eliminate the gap?