Docker Interview — Docker Networking Modes and Internals

Two containers on the same bridge network 172.17.0.0/16. Container A (172.17.0.2) tries to ping Container B (172.17.0.3). Ping fails. You run `iptables -L FORWARD` on host: chain is empty (no rules). But ping fails. What's blocking the traffic, and where do you look next?

Even with empty iptables FORWARD chain, traffic might be blocked if bridge forwarding is disabled or if containers' veth devices aren't properly added to the bridge. Check: (1) `brctl show` or `bridge link` on host → verify container A and B's veth interfaces are listed under docker0. If missing, that's the issue (containers not on bridge). (2) Host kernel bridge forwarding: `cat /proc/sys/net/bridge/bridge-nf-call-iptables` should be 1 (bridge respects iptables). If 0, bridge forwards anyway; if firewall rule exists, might block. (3) Inside Container A: check routing table: `ip route show` → should have `172.17.0.0/16 via 0.0.0.0 on eth0`. If missing, A won't know how to reach B. (4) ARP issue: if A can't resolve B's MAC (ARP failure), ping fails. Run: `docker exec a arp -a` → should show B's IP mapped to MAC. If "no entry", ARP is broken. Fix: (1) verify container connectivity: `docker run -d --name b alpine:latest sleep 1000 && docker run -d --name a alpine:latest sleep 1000 && docker exec a ping -c 1 172.17.0.3` (use container IP from `docker inspect`). If it works, network is fine. (2) If ping fails, check: `docker exec a ip addr show` confirms eth0 has IP, `docker exec a ip route show` confirms route, `docker exec a arp -a` shows neighbor. (3) Enable bridge debugging: `echo 8 > /proc/sys/net/bridge/log_martians` and check dmesg for dropped packets. Result: usually iptables isn't the issue; it's veth interface attachment or ARP.

Follow-up: After troubleshooting, you find Container A's veth isn't bridged (just isolated). How do you manually add it to the docker0 bridge?

Your Docker daemon runs with `--bip 10.0.0.0/8` (custom bridge IP). You launch two containers. Container A (10.0.0.2) tries to reach external host 203.0.113.1. The packet leaves Container A's eth0, goes through docker0 bridge, then hits host's eth0. Does the host forward it? What NAT rule applies?

Yes, the host forwards it if IP forwarding is enabled and NAT is configured. Container A's packet: source=10.0.0.2, dest=203.0.113.1, arrives at docker0 bridge. Host checks forwarding: `cat /proc/sys/net/ipv4/ip_forward` must be 1. If 0, packet is dropped (no forwarding). Assuming forwarding is on, the host's kernel checks iptables POSTROUTING chain. Docker adds a MASQUERADE rule: `iptables -t nat -A POSTROUTING -s 10.0.0.0/8 ! -o docker0 -j MASQUERADE`. This means: packets from 10.0.0.0/8 (containers) going out on a non-docker0 interface are MASQUERADED (source IP rewritten to host's IP, e.g., 203.0.113.2). Result: (1) Container A's packet arrives at host eth0 with source=203.0.113.2 (host IP), dest=203.0.113.1. (2) External host replies to 203.0.113.2, packet returns to host. (3) Host's connection tracking (conntrack) recognizes the reply, reverses the NAT: destination rewritten to 10.0.0.2, packet sent to docker0 → Container A. Test: `docker run -d alpine:latest sleep 1000 && docker exec sleep-container curl -v http://203.0.113.1:80`. Verify NAT: `iptables -t nat -L POSTROUTING` shows MASQUERADE rule. Verify conntrack: `conntrack -L | grep 10.0.0.2` shows connection tracking entry for container. Result: containers can reach external networks via NAT masquerading.

Follow-up: External host returns a SYN-ACK packet to 203.0.113.2 but your Docker host's eth0 is down. Where's the packet lost?

You expose port 8080 from Container A using `-p 8080:8080`. External client connects to host:8080. How does the packet traverse from client → host → container? Which iptables chains and rules process it?

Packet flow: external client (src=client_ip, dest=host_ip:8080) → host kernel → iptables (PREROUTING chain) → DNAT (destination NAT) rewrites dest to container IP:8080 → local delivery → docker0 bridge → container's veth → container receives packet (src=client_ip, dest=container_ip:8080). Docker adds DNAT rule: `iptables -t nat -A PREROUTING -p tcp -d host_ip --dport 8080 -j DNAT --to-destination 10.0.0.2:8080`. When reply arrives (container → client), POSTROUTING chain applies SNAT (source NAT) to rewrite container IP back to host IP, masquerading: `iptables -t nat -A POSTROUTING -s 10.0.0.2 -p tcp --sport 8080 -d client_ip -j SNAT --to-source host_ip`. Result: (1) client sends SYN to 203.0.113.100:8080 (host). (2) PREROUTING DNAT changes dest to 10.0.0.2:8080. (3) Packet routed to docker0, container receives it with dest=container IP. (4) Container replies with src=10.0.0.2, dest=client_ip. (5) POSTROUTING SNAT changes src to host IP. (6) Reply reaches client as if from host. Verify: `iptables -t nat -L PREROUTING` shows DNAT rules. `iptables -t nat -L POSTROUTING` shows SNAT (MASQUERADE) rules. Test: `docker run -p 8080:8080 alpine:latest nc -l -p 8080 &`, external: `curl http://host:8080`, verify connection shows host IP (not container IP) at external end.

Follow-up: Multiple containers expose the same port on different IPs. How does the DNAT rule differentiate which container gets the traffic?

A container runs with `--network none`. It has no network access. But you need to add a network interface later without restarting the container. Can you use `docker network connect`? What happens to the container's IP stack?

Yes, you can use `docker network connect mynetwork container-id` to add a network interface to a running container without restarting. The container's IP stack updates dynamically. Initially (--network none): container has only loopback (lo), no eth0. After `docker network connect`: Docker's daemon: (1) allocates new veth pair, (2) attaches one end to the network's bridge, (3) moves the other end into the container's network namespace via `ip link set netns `, (4) configures IP on veth inside container. (5) signals container via netlink notification. Result: container now has eth0 (or eth1 if multiple networks). Inside container: `ip link show` lists new interface, `ip addr show` shows new IP. Test: `docker run --network none -d --name isolated alpine:latest sleep 1000 && docker exec isolated ip link show` (shows only lo). Then: `docker network connect bridge isolated && docker exec isolated ip link show` (now shows eth0). Verify: `docker inspect isolated | jq '.NetworkSettings.Networks'` shows multiple networks. Result: container can go from isolated (no network) to connected, without restart, via dynamic interface addition. Limitation: container process must be running; if process already listening on sockets, new interface is available but app unaware.

Follow-up: You connect the same container to 3 different Docker networks. Can the container reach services on all 3 networks? Are the routing rules auto-generated?

Host network mode: `--network host`. Container A and Container B both use host network. They each try to bind to port 8080. What happens? Can both listen on the same port simultaneously?

No, only one can bind successfully. Both containers share the host's network namespace (and socket namespace). Port 8080 is a global resource at the OS level. When Container A binds to 0.0.0.0:8080, the socket is registered in the kernel's socket table. When Container B tries to bind to 0.0.0.0:8080, the kernel returns EADDRINUSE (Address already in use) because port is already bound. Result: first container binds successfully, second fails. Test: `docker run --network host -d --name a alpine:latest nc -l -p 8080`, then `docker run --network host -d --name b alpine:latest nc -l -p 8080` → Container B fails (exit code 1, EADDRINUSE). Show logs: `docker logs b` shows "Address already in use". Workaround: bind to different ports (Container A: 8080, Container B: 8081), or use different IPs if host has multiple NICs. Contrast with bridge network: each container has its own socket namespace, so both can bind to 8080 on their respective container IPs (172.17.0.2:8080, 172.17.0.3:8080)—no conflict. Security: --network host is powerful but risky; port conflicts and service isolation challenges. For production: avoid --network host unless necessary (e.g., network monitoring that needs raw traffic access).

Follow-up: You need host network access but want port isolation. Can you use --network host with a network namespace wrapper to re-isolate the port namespace?

Overlay network (multi-host Docker Swarm): Container A on Node 1 (10.0.0.2/overlay) tries to reach Container B on Node 2 (10.0.0.3/overlay). Same subnet, different hosts. Packets are encapsulated in VXLAN. Explain the packet flow: app layer → container → veth → overlay bridge → VXLAN tunnel → remote host → remote bridge → remote container.

Overlay network uses VXLAN (Virtual Extensible LAN) to tunnel container traffic across hosts. Packet flow: (1) Container A app sends packet: src=10.0.0.2, dest=10.0.0.3. (2) Kernel routing in container's netns directs to eth0 (overlay bridge interface). (3) Packet arrives at overlay bridge (br0 or similar on host). (4) Bridge doesn't know how to reach 10.0.0.3/24 directly; it's on a remote host. (5) Bridge forwards to VXLAN device (vxlan0): kernel encapsulates original packet in VXLAN header (adds UDP frame: src_port=random, dst_port=4789, carrying original packet as payload). (6) Kernel wraps in IP header: src=Node1_IP (host IP), dest=Node2_IP (remote host IP). (7) Packet sent over host network (actual data link) to Node 2. (8) Node 2 kernel receives UDP on port 4789, VXLAN driver unwraps packet: extracts original src=10.0.0.2, dest=10.0.0.3. (9) Routes to local overlay bridge → Container B's veth → Container B. Show: `docker network ls | grep overlay` lists overlay networks. Inside container: `ip link show` shows eth0 on 10.0.0.0/24. On host: `ip link show` lists vxlan0 device, `bridge fdb show dev vxlan0` shows learned MAC-to-remote-host mappings. Verify VXLAN tunnel: `tcpdump -i eth0 dst port 4789` on host network → see encapsulated packets. Test: create overlay network, deploy containers on multiple nodes, verify connectivity. Result: applications see transparent Layer 2 connectivity across hosts; VXLAN handles tunneling automatically.

Follow-up: VXLAN encapsulation adds 50 bytes overhead per packet. For high-throughput services, does this impact performance, and how do you monitor it?

Macvlan driver: Container uses `--driver macvlan --ip 192.168.1.100` on a physical network (not Docker bridge). Container sends packets to host's gateway 192.168.1.1. But the host (also 192.168.1.x) cannot reach Container. Why? Container ↔ container works, but host ↔ container blocked.

Macvlan driver bridges containers directly to physical network (bypasses docker0 bridge). Each container gets a unique MAC address on the physical network and acts like a separate device. Issue: the physical network interface (host's NIC) doesn't know how to receive frames destined to the container's MAC. Most NICs don't allow promiscuous mode by default, and host interface can't intercept frames meant for container's MAC. Specific issue: host has eth0 (192.168.1.50). Container has macvlan interface pointing to eth0 with MAC=aa:bb:cc:dd:ee:ff (unique), IP=192.168.1.100. When host sends packet to 192.168.1.100, ARP resolves to the macvlan interface's MAC. But physical switch forwards frame to the container (veth on remote bridge). Host's eth0 doesn't see frame (it's not addressed to eth0's MAC). Solution: enable promiscuous mode: `ip link set eth0 promisc on`. This allows eth0 to receive frames not addressed to its MAC. Or use macvlan's "mode=bridge" (vs passthru): host can communicate with container via bridge. Verify: before fix: `docker run --driver macvlan --ip 192.168.1.100 alpine:latest ping 192.168.1.50` (host IP) → fails. After `promisc on`: works. Alternative: use separate interface for macvlan (eth1) different from host's primary (eth0)—host and container don't share NIC, no promiscuous issue. Test: `docker network create -d macvlan -o parent=eth0 macnet && docker run --net macnet --ip 192.168.1.100 alpine:latest ping 192.168.1.1` → initially fails, after `promisc on`, works.

Follow-up: Macvlan (promiscuous) has security implications: all traffic on eth0 visible to containers. How do you limit container visibility to trusted networks?

Container with custom network mode `--network container:another-container`. Both containers share network namespace (same IP, same ports). Container A binds to port 8080. Container B tries to bind to 8080. What happens? Can they coexist on same port?

No, only one succeeds. `--network container:X` makes both containers share the same network namespace (single network stack). Port 8080 is a shared resource in that namespace. When Container A binds to 0.0.0.0:8080, port is occupied. Container B's bind attempt fails (EADDRINUSE). This is identical behavior to `--network host` (one network stack, one port space). Use case: sidecar pattern in Kubernetes or Docker Compose: main app container + logging sidecar share network. Sidecar (log agent) listens on 8080 locally (localhost:8080), main app connects on 127.0.0.1:8080 (no external exposure, efficient local communication). Verify: (1) `docker run -d --name main alpine:latest nc -l -p 8080`. (2) `docker run -d --network container:main --name sidecar alpine:latest nc -l -p 8080` → fails (EADDRINUSE). (3) Instead: sidecar on 8081: `docker run -d --network container:main --name sidecar alpine:latest nc -l -p 8081` → succeeds. Inside container: `docker exec main ss -tulpn` shows both 8080 and 8081 (shared netns). On host: `ss -tulpn | grep -E ':(8080|8081)'` shows bindings (container processes). For production: this pattern is safe (sidecar + app tightly coupled), but requires careful port coordination. Audit: `docker inspect container | jq '.HostConfig.NetworkMode'` shows "container:other-container-id".

Follow-up: Your logging sidecar needs to scrape metrics on port 9090 (app already uses it). How do you expose different services on same port through different interfaces?