AWS Interview — ECS Fargate Task Networking and Service Discovery

Your Fargate tasks (web service, API service, worker service) are running in the same ECS cluster, same VPC/subnet. They can reach each other via private IPs in development. But in production, you enabled service discovery (AWS Cloud Map / Route53). Now: "api-service can't resolve api-service.example.local. Connection timeout. DNS lookup returns empty." Walk through debugging.

Fargate service discovery troubleshooting (DNS resolution failure): (1) Verify service discovery is enabled: (a) ECS Service → Details. Look for "Service Registries". Should show: `api-service.example.local` pointing to Cloud Map namespace. (b) If blank, service discovery isn't configured. Fix: add service registry (Cloud Map namespace + service name). (c) Cloud Map Console: verify service exists in namespace. Service name should match `api-service.example.local`. (2) DNS resolution test (from within task): (a) Exec into running task: `aws ecs execute-command --cluster prod --task --container api --interactive --command /bin/bash`. (b) Test DNS: `nslookup api-service.example.local`. Output should show private IP (e.g., 10.0.1.50). (c) If "NXDOMAIN" error (not found), root cause is one of: (i) Service not registered in Cloud Map. (ii) VPC DNS setting disabled. (iii) ECS task IAM role lacks permissions to query Cloud Map. (3) VPC DNS settings check: (a) VPC → DNS settings. Verify: "Enable DNS resolution" = true, "Enable DNS hostnames" = true. (b) Both must be enabled for internal DNS. If either is false, DNS lookups fail. Fix: `aws ec2 modify-vpc-attribute --vpc-id vpc-xxx --enable-dns-resolution` + `--enable-dns-hostnames`. (c) Takes 30 sec to apply. (4) Cloud Map registration validation: (a) Cloud Map Console → Namespace (example.local) → Service (api-service). Verify: "Instances" tab shows running Fargate tasks. (b) If empty, tasks never registered. Root cause: (i) Service registry in ECS service not set to use Cloud Map. (ii) ECS task IAM role lacks `servicediscovery:RegisterInstance` permission. (iii) Fargate tasks didn't start successfully (crashed before registration). (5) Fix for common scenarios: (a) Service registry not configured: (i) Update ECS service: add "Service Registries" → Cloud Map namespace + service name. (ii) ECS auto-registers running tasks. (iii) Takes 30 sec to propagate. (b) IAM role missing permissions: (i) Task execution role needs: `servicediscovery:RegisterInstance`, `servicediscovery:DeregisterInstance`. (ii) Update trust policy: allow ECS tasks to assume this role. (iii) Redeploy tasks (they'll register with Cloud Map). (c) DNS not enabled on VPC: update VPC settings (1 min). Restart tasks (they'll re-register). (6) Verification: (a) From another task, test: `curl http://api-service.example.local:8080/health` (should return 200 if API is healthy). (b) CloudWatch Logs: check if Cloud Map registration logged. (c) Route53 console: verify "Hosted zone" for namespace shows A records for api-service pointing to task IPs.

Follow-up: DNS resolution now works: api-service.example.local → 10.0.1.50. But connection times out (no TCP). Security group or network ACL issue?

You have 3 Fargate tasks running an API service in different availability zones (AZ-a, AZ-b, AZ-c). You configured service discovery to register all 3 tasks. A client (another Fargate task) does: `curl http://api-service.example.local:8080`. DNS resolves to one IP. But you have 3 IPs (one per AZ). Is DNS doing load balancing? How do you verify traffic is distributed across AZs?

Service discovery DNS load balancing across AZs: (1) DNS behavior (Route53 for Cloud Map service discovery): (a) When client queries api-service.example.local, Route53 returns all A records (e.g., 3 IPs: 10.0.1.50, 10.0.2.50, 10.0.3.50). (b) DNS response includes all 3 IPs. Client typically uses first IP. (c) This is NOT load balancing in the DNS layer (all responses are the same). Actual load balancing happens at the application/connection layer. (d) If first IP is unresponsive, client retries with next IP (connection-level retry). (2) Verify load distribution: (a) Application-level: most clients don't retry failed connections. They use first IP. So if all clients query api-service.example.local, they all get IP 10.0.1.50 (first in list), breaking load distribution. (b) Better approach: use AWS Cloud Map service discovery with ECS Service Load Balancing. This is different from DNS round-robin. (c) How: ECS Service → Load Balancer → Network Load Balancer (NLB) in front of tasks. Each task registers with NLB. NLB does actual load balancing (Layer 4, TCP). (3) Setup for true load balancing: (a) Create Network Load Balancer: (i) Configure target group: type "IP", port 8080, health check on /health. (ii) Register 3 tasks (10.0.1.50:8080, 10.0.2.50:8080, 10.0.3.50:8080). (b) ECS Service: (i) Load Balancer type: NLB. (ii) Target group: point to NLB target group (above). (c) Service discovery: (i) Cloud Map register endpoint: NLB DNS name (e.g., api-nlb-123.elb.amazonaws.com), not individual task IPs. (ii) Clients query: api-service.example.local → resolves to api-nlb-123.elb.amazonaws.com → NLB load balances across 3 tasks. (4) Verification: (a) Run 100 requests from client task: `for i in {1..100}; do curl http://api-service.example.local:8080/hostname; done | sort | uniq -c`. Output should show ~33 requests per task (balanced). (b) NLB metrics in CloudWatch: Target Group → request count per target. Verify all 3 targets receiving traffic. (c) CloudWatch Logs: task logs should show similar request count across 3 tasks. (5) Cost trade-off: (a) Service discovery alone (Route53): free, but not true load balanced (DNS round-robin is dumb). (b) NLB: $16/month + data processing. But gives true load balancing. (6) Alternative (cheaper, if NLB cost is concern): Use ECS Service discovery without NLB, but implement connection pooling + retry logic in client. Client: pool 3 connections to 3 task IPs (not just first IP). Distribute requests round-robin. More operational burden.

Follow-up: You added NLB, traffic is now balanced. But one task in AZ-a keeps failing health checks. NLB removes it from pool. Other 2 tasks get 50/50 traffic. Customers in AZ-a experience higher latency (cross-AZ). How do you minimize cross-AZ traffic?

Fargate task network performance is degraded: `netperf` between 2 tasks shows 10Mbps throughput, but should be ~1Gbps (EC2 could achieve). Network overhead is massive. VPC endpoints, security groups are configured normally. What's consuming bandwidth?

Diagnose Fargate network performance degradation: (1) 10Mbps vs 1Gbps (100x difference) suggests: (a) Task placement too spread out (cross-AZ communication = more overhead). (b) Noisy neighbor (other tasks competing for bandwidth). (c) ENI throughput limit exceeded. (d) DNS resolution slowness (connection setup delayed). (2) Investigate each: (a) Task placement: ECS task placement strategy. If "spread" (default), tasks are placed across AZs. Check: run `netperf` between tasks in SAME AZ vs DIFFERENT AZs. (i) Same AZ: 200+ Mbps. (ii) Different AZ: 50+ Mbps (cross-AZ cost). (iii) If both slow, continue investigating. (b) ENI throughput limits: Fargate tasks have different network limits by CPU: (i) 0.25 vCPU → 10 Mbps (max). (ii) 0.5 vCPU → 10 Mbps. (iii) 1 vCPU → 100 Mbps. (iv) 2+ vCPU → 1000+ Mbps. (v) Action: if task is 0.25 vCPU, network cap is 10 Mbps. Upgrade to 1 vCPU for 100 Mbps. (c) Check CPU throttling: if task is 0.25 vCPU and CPU at 100%, throttling occurs (CPU can't keep up, impacts network). Upgrade CPU. (d) DNS resolution latency: `time nslookup api-service.example.local`. Should be <10ms. If >100ms, DNS server is overloaded. (3) Best practices for network performance: (a) Use awsvpc network mode (default for Fargate, required). (b) Place tasks in same subnet/AZ for local communication (use task placement constraint). (c) Allocate >=1 vCPU + 2GB memory for Fargate (min for reasonable network throughput). (d) Use security group optimization: no "allow all" rules (slower scanning). Use specific CIDR ranges. (e) Enable Enhanced networking: not available on Fargate (fixed limitation), but available on EC2. If absolute peak network needed, migrate to EC2. (4) Monitoring: (a) CloudWatch metric: ENI throughput (not exposed directly, but visible in VPC Flow Logs). (b) VPC Flow Logs: enable for tasks' ENIs. Parse logs: calculate bytes/sec per ENI. (c) Compare: expected throughput (based on CPU tier) vs actual. If actual < 50% of expected, investigate tuning. (5) Solution: (a) Upgrade task from 0.25 vCPU to 1 vCPU: network throughput 10Mbps → 100 Mbps (potential). (b) Place tasks in same AZ: cross-AZ latency reduction. (c) Cost: 0.25 vCPU = $0.007/hour. 1 vCPU = $0.024/hour. Difference: $0.017/hour = $12/month per task. Acceptable for 10x network performance gain. (6) Verification: after upgrade + placement changes, re-run netperf. Should see 100+ Mbps (10x improvement).

Follow-up: Network throughput improved to 100Mbps. But service-to-service latency is still 50ms (should be <5ms within VPC). Tracing shows DNS lookup takes 20ms. Why is Cloud Map DNS so slow?

Your ECS Fargate service has a bug: the service discovery name is hard-coded in the application (`api-service.example.local`). During deployment, you want to rename the service to `api-v2.example.local`. Old tasks still register under `api-service.example.local`. Clients expecting `api-v2.example.local` get connection timeouts. How do you orchestrate a clean switchover?

Safely switch service discovery names without downtime: (1) Problem: hard-coded service name in app. Renaming requires code change + deployment + client retries. (2) Orchestration plan (zero-downtime): (a) Phase 1 (preparation, before code deploy): (i) Create new service in Cloud Map: `api-v2.example.local`. (ii) Don't register tasks yet. (iii) Verify DNS resolves to empty set (expected). (b) Phase 2 (deploy new code, register dual names): (i) Deploy new task definition with service discovery pointing to BOTH `api-service.example.local` AND `api-v2.example.local`. (ii) ECS: update service to register tasks under both names. (iii) Now running tasks appear under both DNS names. (iv) Clients can use either name (transitional period). (c) Phase 3 (client switchover, after deployment complete): (i) One by one, redeploy client tasks. Update hardcoded name to `api-v2.example.local`. (ii) Old clients still use `api-service.example.local` (works, dual registration). New clients use `api-v2.example.local`. (iii) Once all clients updated, phase 4. (d) Phase 4 (cleanup, remove old name): (i) Update service to unregister from `api-service.example.local`. (ii) Keep only `api-v2.example.local`. (iii) Monitor for 1 hour (in case a forgotten client tries old name). (3) Implementation (ECS task definition JSON): ```json { "family": "api-service", "containerDefinitions": [...], "requiresCompatibilities": ["FARGATE"], "cpu": "1024", "memory": "2048", "serviceRegistries": [ { "registryArn": "arn:aws:servicediscovery:...:service/example/api-v2", "port": 8080 }, { "registryArn": "arn:aws:servicediscovery:...:service/example/api-service", "port": 8080 } ] } ``` (4) Phase timing: (a) Total switchover: 1-2 hours (deploy new code 10 min, client redeploy 5-60 min depending on client count, cleanup 10 min). (b) Risk: if a client isn't updated, it still uses old name. Graceful (service still works), but migration incomplete. (5) Rollback plan: (a) If issues arise, revert task definition to single name registration. (b) Keep both names registered for 48 hours (grace period). Then remove old name. (6) Alternative (if hard-coded name not feasible to change): (a) Use alias: create CNAME in Route53 from `api-v2.example.local` → `api-service.example.local`. (b) Clients query `api-v2`, resolve to `api-service` (transparent). (c) Later, when app code updated, flip alias direction. (d) Even simpler: no code change needed, just DNS alias in Cloud Map.

Follow-up: During phase 2, tasks are registered under both names. But CloudMap is hitting DNS query rate limit (1000 queries/sec). Clients get "SERVFAIL" errors. How do you handle DNS scaling?

Your Fargate service uses service discovery with ECS service networking. During deployment, ECS stops old tasks and launches new ones. For 30 seconds, service discovery shows: some tasks registered, some de-registered, some pending. Clients get "host not found" or stale IP errors. How do you implement graceful deployment with service discovery?

Graceful Fargate deployment with service discovery (zero-downtime): (1) Root cause: deployment doesn't wait for new task health before deregistering old ones. (a) Old tasks deregister immediately. (b) New tasks take 10-30 sec to boot + health check pass. (c) Client DNS queries during 10-30 sec gap hit cache, or get empty set. (2) ECS deployment strategy (blue-green pattern): (a) Blue (current): running tasks registered in service discovery. (b) Green (new): new task definition, not yet registered. (c) Deployment flow: (i) Launch green tasks (new code). Do NOT deregister blue yet. (ii) Wait for green tasks to pass health checks (30-60 sec). (iii) Register green tasks in service discovery (they appear in DNS). (iv) Deregister blue tasks from service discovery (DNS now points only to green). (v) Stop/terminate blue tasks. (3) ECS service configuration for zero-downtime: (a) Deployment controller: select "Blue/Green" (not "Rolling"). (b) Health check grace period: set to 60 sec (wait for new task to stabilize before health check). (c) Connection draining: set ALB/NLB deregistration delay to 60 sec (allow in-flight requests to complete). (d) Service registries: Cloud Map will auto-register/deregister based on task health. (4) Task health check best practices: (a) Define health check in task definition: ```json "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 10, "timeout": 5, "retries": 2 } ``` (b) Threshold: 2 consecutive failures before marking unhealthy. (c) After unhealthy, ECS deregisters from service discovery (~20 sec total). (5) Deployment timeline: (a) T=0: trigger ECS deployment. (b) T=0-30 sec: new green tasks launching, no health checks yet. (c) T=30-60 sec: green tasks passing health checks, registered in service discovery. DNS now shows both blue + green IPs. (d) T=60-120 sec: requests still flowing to blue tasks (some stale connections). Green tasks handling new connections. (e) T=120 sec: blue tasks deregistered, stopped. Blue IPs gone from DNS. (f) T=120+ sec: only green tasks receiving traffic. Deployment complete. (g) Total: ~2 min for full switchover (acceptable downtime: ~0 sec for most clients). (6) Monitoring: (a) CloudWatch metric: service task count. Should show: blue tasks + green tasks (sum increases), then blue drops to 0. (b) Application logs: request count per task. Green tasks should show increasing count. (c) DNS cache: CloudWatch Logs (VPC Flow Logs) show DNS queries to old IPs (stale cache). Normal, resolves after TTL (default 60 sec). (7) Cost: no additional cost. Same Fargate infrastructure, just orchestrated differently.

Follow-up: Deployment is working, but during switchover, some clients get "Connection reset" error. They connected to a blue task mid-flight, task deregistered, connection dropped. How do you prevent connection loss?

You have 50 Fargate tasks across 5 services. Each service has its own Cloud Map namespace + service discovery entry. Debugging a multi-service issue requires checking: "Does service-a reach service-b? Does service-b reach service-c?" You're manually testing DNS + pinging. With 50 tasks, this is tedious. How do you automate connectivity validation?

Automate multi-service connectivity validation: (1) Problem: 50 tasks × 5 services = manual testing matrix explosion. Need automated, repeatable validation. (2) Connectivity test harness (Lambda + EventBridge): (a) Lambda function: (i) Input: list of services (a, b, c) and expected connectivity graph. (ii) For each service pair: curl from service-a task to service-b endpoint (via service discovery name). (iii) Log: success (200 OK) or failure (timeout, 500, etc.). (iv) Generate report: connectivity matrix. (b) EventBridge trigger: daily at 2 AM, run Lambda. On failure, alert on-call. (3) Implementation: ```python def test_connectivity(services): results = {} for source in services: results[source] = {} for target in services: if source == target: continue # skip self try: resp = requests.get(f"http://{target}.example.local/health", timeout=5) results[source][target] = "OK" if resp.status_code == 200 else f"FAIL: {resp.status_code}" except Exception as e: results[source][target] = f"ERROR: {str(e)}" # send report to CloudWatch Logs + SNS if any failures cloudwatch.put_metric_data('Connectivity', 'FailureCount', value=sum(1 for v in results.values() if any("FAIL" in val for val in v.values()))) if errors, sns.publish(Topic='AlertTopic', Message=json.dumps(results, indent=2)) return results ``` (4) Integration with SQS (for distributed testing): (a) Instead of Lambda testing all from one task, delegate to ECS tasks themselves. (b) SQS message: {"source": "service-a", "targets": ["b", "c", "d"]}. (c) Service-a task consumes message, pings all targets, reports back. (d) Parallel: all 5 services test simultaneously (5x faster). (e) Results → CloudWatch Logs + DynamoDB. (5) Alerting: (a) CloudWatch alarm: if connectivity failure count > 0, page on-call. (b) Create incident in Jira: include connectivity matrix + last 10 min of VPC Flow Logs for failed pair. (c) Automated troubleshooting: (i) If service-a can't reach service-b, Lambda checks: SG rules (a-b), DNS resolution (can a resolve b's name?), b's health status. (ii) Generates hypothesis: "SG rule missing" or "DNS stale" or "service-b crashed". (iii) Suggest fix in alert. (6) Validation cadence: (a) Daily: full connectivity matrix. (b) Post-deployment: trigger connectivity test after ECS update. If any failures, rollback deployment. (c) On-demand: developer can trigger test manually for debugging. (7) Cost: Lambda ~$0.50/month (daily invocation), DynamoDB (store results) <$1/month, SNS <$1/month. Total: ~$3/month for comprehensive validation. ROI: prevents 30-min debugging session when connectivity breaks (typical incident = $300+ productivity loss).

Follow-up: Connectivity test shows all services OK at 2 AM, but at 10 AM, service-a can't reach service-b. Logs show no DNS errors, but connection times out. Security group rule or network ACL changed?