AWS Interview — VPC Networking, Subnets, and Route Tables

Your Lambda function in a private subnet needs to call an external API (e.g., Stripe). You configured a NAT gateway in the public subnet, added a route to the private subnet's route table sending `0.0.0.0/0` to the NAT, but the function still times out trying to reach the API. Trace the packet path and identify where it fails.

The packet path in VPC networking is: Source (Lambda in private subnet) → Subnet route table → NAT gateway (in public subnet) → Internet gateway (IGW) → Internet → Back. If it fails, check each hop: (1) Lambda's security group — verify outbound rules allow the external IP/port (e.g., port 443 for HTTPS). Run `aws ec2 describe-security-groups --group-ids sg-xxxxx` and check the `IpPermissionsEgress`. (2) Private subnet's route table — verify a route exists for `0.0.0.0/0` pointing to the NAT gateway's ENI or use the NAT gateway ID directly. Run `aws ec2 describe-route-tables --route-table-ids rtb-xxxxx`. (3) NAT gateway — it must be in the AVAILABLE state. Run `aws ec2 describe-nat-gateways --nat-gateway-ids natgw-xxxxx`. If state is PENDING or FAILED, wait or recreate. (4) Public subnet's route table — the public subnet containing the NAT needs a route for `0.0.0.0/0` pointing to the IGW. (5) IGW — must be attached to the VPC. Run `aws ec2 describe-internet-gateways --internet-gateway-ids igw-xxxxx`. (6) Elastic IP — NAT gateways require an EIP. Verify the NAT has an EIP: `aws ec2 describe-nat-gateways --nat-gateway-ids natgw-xxxxx | jq '.NatGateways[0].NatGatewayAddresses'`. If the Lambda times out after confirming all hops, the issue is likely the NAT gateway's security group or the external API's firewall blocking the NAT's EIP.

Follow-up: All the hops check out. You curl the external API from an EC2 instance in the same private subnet and it works. What's different about Lambda's egress?

You have two private subnets in the same AZ (AZ-a). You create a route in subnet-1's route table sending traffic to a specific IP address (10.0.2.0/24, in a different VPC) to an instance in subnet-2. The traffic still doesn't reach. Why?

The route table entry points to an instance ENI in a different subnet within the same VPC, but instances can't route traffic like that. Route table targets must be either: (1) Internet gateway (for 0.0.0.0/0 traffic leaving the VPC), (2) NAT gateway (for private subnets to reach internet), (3) VPC peering connection (for inter-VPC traffic), (4) Transit gateway (for hub-and-spoke networking), (5) VPN connection (for on-prem connectivity), or (6) Network interface (for special cases, usually Lambda to specific IPs). To reach a different VPC, use VPC peering or Transit Gateway. Steps: (1) Create a VPC peering connection between your VPC and the target VPC using `aws ec2 create-vpc-peering-connection --vpc-id vpc-xxxxx --peer-vpc-id vpc-yyyyy`. (2) Accept the peering connection in the peer VPC: `aws ec2 accept-vpc-peering-connection --vpc-peering-connection-id pcx-xxxxx`. (3) Add routes in both VPCs' route tables pointing the target CIDR to the peering connection. (4) Ensure security groups and NACLs on both sides allow the traffic. If this is a permanent hub-and-spoke setup with multiple VPCs, Transit Gateway is cleaner and more scalable.

Follow-up: You set up VPC peering. Instances in one VPC can ping instances in the other, but TCP traffic on port 80 fails. Where do you check?

A developer created a new public subnet with route table entries for the IGW, but instances launched there can't reach the internet. They can ping from within the VPC, but external traffic times out. What's the likely oversight?

The developer likely created the public subnet and route table but didn't enable auto-assign public IP or didn't attach an Elastic IP to instances. For an instance to use an IGW, it needs a public IP (either auto-assigned or an EIP). Check: (1) Subnet's auto-assign setting — run `aws ec2 describe-subnet-attribute --subnet-id subnet-xxxxx --attribute mapPublicIpOnLaunch`. If false, enable it: `aws ec2 modify-subnet-attribute --subnet-id subnet-xxxxx --map-public-ip-on-launch`. (2) Instance's public IP — run `aws ec2 describe-instances --instance-ids i-xxxxx | jq '.Reservations[0].Instances[0].PublicIpAddress'`. If null, either the subnet didn't auto-assign, or the instance was launched without a public IP. (3) Launch new instances with `--associate-public-ip-address` flag, or assign an EIP: `aws ec2 allocate-address --domain vpc` then `aws ec2 associate-address --allocation-id eipalloc-xxxxx --instance-id i-xxxxx`. (4) Verify the route table has the IGW route — `aws ec2 describe-route-tables --route-table-ids rtb-xxxxx` should show a route for `0.0.0.0/0` with target type IGW. (5) Check the security group allows outbound traffic on port 443/80 — egress rules default to allowing all, but verify with `aws ec2 describe-security-groups --group-ids sg-xxxxx`. If none of these are the issue, check NACLs — run `aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx` and ensure inbound/outbound rules allow the traffic.

Follow-up: The subnet has auto-assign enabled, instances have public IPs, and the route table is correct. You SSH to the instance and `curl` to an external site — it hangs. What's the NACL issue?

Your VPC uses a custom NACL on a private subnet that explicitly allows inbound SSH (port 22) from your office IP and allows all outbound traffic. An EC2 instance in that subnet can SSH out to external servers, but external users cannot SSH into it from your office. Explain the asymmetry and fix it.

NACLs (Network Access Control Lists) are stateless — inbound and outbound rules are evaluated independently. The asymmetry is: (1) Outbound SSH (initiated by the instance) — the rule "allow all outbound" permits it, and the response traffic is matched by the same outbound rule (or the stateless nature allows the return). (2) Inbound SSH (initiated externally) — the NACL has no inbound rule for port 22 from the external IP. Even though there's a custom inbound rule for port 22 from your office, the response return traffic (typically ephemeral ports 1024-65535) must also be allowed inbound. NACLs are stateless, so the return traffic is not automatically allowed. Fix: (1) Add an inbound rule allowing SSH from your office IP on port 22: `aws ec2 create-network-acl-entry --network-acl-id acl-xxxxx --rule-number 100 --protocol tcp --port-range FromPort=22,ToPort=22 --cidr-block 203.0.113.0/32 --ingress`. (2) Add an inbound rule allowing ephemeral port responses: `aws ec2 create-network-acl-entry --network-acl-id acl-xxxxx --rule-number 110 --protocol tcp --port-range FromPort=1024,ToPort=65535 --cidr-block 0.0.0.0/0 --ingress` (or restrict to your office IP). (3) Verify with `aws ec2 describe-network-acls --network-acl-ids acl-xxxxx`. Security groups are stateful, so they don't have this issue — they're often easier to work with. However, NACLs provide an additional layer of defense.

Follow-up: You add the inbound NACL rules, but SSH still fails from your office. You can SSH from the instance outbound. Where's the problem now?

You have a VPC with subnets in multiple AZs. You create a single NAT gateway in AZ-a and add routes in all subnets' route tables pointing to it. Subnets in AZ-a work fine, but subnets in AZ-b experience intermittent failures reaching the internet. Why?

NAT gateways must be in the same AZ as the instances using them, or cross-AZ traffic is charged per GB. More importantly, if a NAT gateway is overloaded or fails in one AZ, instances in other AZs don't have a fallback. AWS bills cross-AZ data transfer at $0.02/GB, so routing all outbound traffic from AZ-b through a NAT in AZ-a is expensive and adds latency. The best practice is to deploy one NAT gateway per AZ. Fix: (1) Create another NAT gateway in AZ-b — allocate a new EIP, create the NAT in a public subnet in AZ-b. (2) Update the route table for subnets in AZ-b to point to the NAT in AZ-b. (3) For subnets in AZ-a, keep the route pointing to the NAT in AZ-a. This ensures each AZ is self-contained. To implement: (1) `aws ec2 allocate-address --domain vpc` (get AllocationId), (2) `aws ec2 create-nat-gateway --allocation-id eipalloc-xxxxx --subnet-id subnet-PUBLIC-in-AZ-b`, (3) `aws ec2 describe-route-tables --filters Name=tag:Name,Values=private-AZ-b` and update the route: `aws ec2 replace-route --route-table-id rtb-xxxxx --destination-cidr-block 0.0.0.0/0 --nat-gateway-id natgw-NEW`. This setup also improves resilience — if one NAT fails, only that AZ is affected.

Follow-up: You deploy NATs in both AZs. The subnets in AZ-b still have intermittent failures. The NAT gateway status is AVAILABLE. What else?

Your organization uses a transit gateway to connect 5 VPCs. A subnet in VPC-1 should reach a subnet in VPC-5, but the route table in VPC-1's subnet has no explicit route to VPC-5's CIDR. However, the transit gateway attachment exists and is associated. Should the traffic work?

No. Transit gateway attachments are not automatically added to route tables. You must explicitly add a route in the subnet's route table pointing to the transit gateway. The flow is: (1) Create transit gateway (TGW) and attachments (subnet + VPC pairs), (2) In each VPC, enable transit gateway route propagation or add static routes pointing remote VPC CIDRs to the TGW, (3) Enable or accept association in the TGW route table (if using dynamic routing). To fix: (1) Run `aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx` to find the route table. (2) Add a route for VPC-5's CIDR pointing to the TGW: `aws ec2 create-route --route-table-id rtb-xxxxx --destination-cidr-block 10.5.0.0/16 --transit-gateway-id tgw-xxxxx`. (3) Alternatively, enable route propagation: `aws ec2 enable-transit-gateway-route-table-propagation --transit-gateway-route-table-id tgw-rtb-xxxxx --transit-gateway-attachment-id tgw-attach-xxxxx`. (4) Verify the route exists: `aws ec2 describe-route-tables --route-table-ids rtb-xxxxx`. (5) Confirm the TGW attachment is AVAILABLE and the association state is associated: `aws ec2 describe-transit-gateway-attachments --filters Name=transit-gateway-id,Values=tgw-xxxxx`. If all steps are done, traffic should flow through the TGW. If it still fails, check security groups and NACLs on both subnets.

Follow-up: You add the route, the attachment is available, but traffic still doesn't flow. A traceroute from a source instance hangs at the TGW hop. What's missing?

Your VPC has a subnet with a /28 CIDR block (4 usable IPs). You've already launched 2 EC2 instances and now get an error when trying to launch a third: "Insufficient IP addresses available in subnet." How do you fix this without re-architecting?

A /28 CIDR block has 16 total addresses, but AWS reserves 5 for network operations (network, broadcast, router, DNS, and reserved for future use), leaving 11 usable IPs. If you've launched 2 instances and hit the limit, either: (1) Other resources are consuming IPs — check with `aws ec2 describe-network-interfaces --filters Name=subnet-id,Values=subnet-xxxxx`. Each ENI uses one IP, and instances can have multiple ENIs. (2) Reserved IPs — RDS instances, Lambda (with VPC) can consume IPs. (3) The subnet is too small. The immediate fix without re-architecting: Create a new subnet with a larger CIDR block (e.g., /24 has 251 usable IPs) and launch new instances there. Then either: migrate existing instances or disable auto-assign and manually manage IPs. Steps: (1) `aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.0.1.0/24 --availability-zone us-east-1a`, (2) Launch instances in the new subnet. Longer-term: re-architect the VPC with larger subnets from the start. For production, use /24 or larger per subnet to avoid this. If you absolutely can't add subnets, stop non-essential instances or delete unused ENIs to free IPs, but this is not scalable.

Follow-up: You check network interfaces and see only 2 running instances, no extra ENIs. But `describe-network-interfaces` shows 5 interfaces in the subnet, all with status "in-use". What are the extra 3?

You're setting up a hybrid cloud environment with an on-premises network (192.168.0.0/16) connecting to your AWS VPC (10.0.0.0/16) via a VPN. The VPN connection is up, and you can ping from AWS to on-prem, but traffic from on-prem to AWS timeouts. Routes and security groups look correct. Where do you check next?

VPN traffic must be bi-directional, and the route tables on both sides must point to the VPN connection, not just one direction. Debug steps: (1) Confirm VPN connection status — `aws ec2 describe-vpn-connections --vpn-connection-ids vpn-xxxxx` should show state "available" and tunnel status "UP". (2) Check AWS-side routes — `aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx` should have a route for 192.168.0.0/16 pointing to the VPN gateway (VGW). (3) Check NAT translation — if your on-prem network uses NAT that translates outbound IPs, AWS might not recognize return traffic. Verify the source IP of packets from on-prem matches what the AWS route table expects. (4) Check on-prem route table — your on-prem router must have a route for 10.0.0.0/16 pointing to the VPN connection. This is typically set up during VPN provisioning but can be missed. Ask your network team to verify. (5) Check NACLs on the target subnet — `aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx` should allow inbound traffic from 192.168.0.0/16. (6) Check customer gateway — `aws ec2 describe-customer-gateways --customer-gateway-ids cgw-xxxxx` should reflect your on-prem VPN endpoint IP. If mismatch, the VPN tunnel won't form. The most common issue is missing routes on the on-prem side or NACLs blocking return traffic.

Follow-up: The routes and NACLs are correct on both sides. You run `tcpdump` on an EC2 instance and see packets arriving from on-prem, but no response is sent. Why?