Docker Interview — Linux Namespaces and cgroups

A container can see host processes in `ps aux` output. It sees PIDs like 1, 7, 42 (host sshd, cron, kernel threads). Container should see only its own processes (PIDs 1, 2, etc. mapped to container's PID 1). What went wrong with PID namespace isolation?

The container was started with `--pid host` (or `--pid="host"`). This skips PID namespace isolation. Without `--pid host`, each container gets its own PID namespace (via `unshare(CLONE_NEWPID)`). In its namespace, processes are renumbered: container's init is PID 1, child processes are 2, 3, etc. Host sees them as high PIDs (e.g., 48392, 48393). Inside container's namespace, they appear as 1, 2, 3. Check: `docker run myapp ps aux` shows PID 1 as /myapp. On host: `ps aux | grep myapp` shows actual PID (e.g., 48392). With `--pid host`, container's `ps aux` shows host's full process table. Fix: remove `--pid host` flag. Verify: `docker run (without --pid host) myapp ps aux | head -5` shows only container processes. In OCI spec: `pid.mode` is "container" (isolated, default) vs "host" (no isolation). For production: avoid `--pid host` unless explicitly needed (e.g., system monitoring agent). To audit: `docker inspect container | jq '.HostConfig.PidMode'` shows "host" or null (default). To see namespace isolation: on host, run `ls -la /proc//ns/pid`; open it from container: `docker exec container cat /proc/1/ns/pid`. If PIDs are identical, namespace is shared (--pid host). Different = isolated (correct).

Follow-up: Your monitoring app needs to see all host processes for metrics. Can you grant it limited process visibility without --pid host?

A container's memory usage spikes. `docker stats` shows 800MB used, but container's memory limit is 1GB. Suddenly OOM killed (exit code 137). Why? The app only allocated 512MB intentionally. What's consuming the extra 288MB?

Memory limits are enforced at the kernel level via cgroups. The container's memory usage includes: (1) app heap/stack, (2) kernel memory (page cache, network buffers, socket buffers), (3) runtimes overhead. Cgroups v1 counts both user and kernel memory; cgroups v2 counts them separately. The app allocated 512MB, but kernel cached 288MB for filesystem I/O, page buffering, etc. Total: 800MB. When the app (or kernel) tries to allocate more, the kernel reclaims page cache aggressively (shrink_slab, writeback) to stay under 1GB. If reclaim fails, OOM killer fires, selecting the process consuming most memory (usually the main app) and sends SIGKILL. Result: container exit 137. Diagnose: (1) `docker stats app` before crash shows memory trend. (2) Inside container, `free -h` and `cat /proc/meminfo` before OOM shows page cache usage. (3) On host: `cat /sys/fs/cgroup/docker/container-id/memory.stat` (cgroups v1) or `/sys/fs/cgroup/docker.slice/docker-container-id.scope/memory.stat` (v2) lists cache, rss, swap, etc. For fix: (1) increase container memory limit: `-m 2g`. (2) disable swap in cgroups: `--memory-swap` equals `--memory` (no swap). (3) tune kernel: `vm.swappiness=0` to reduce page cache pressure. Verify: `docker run -m 1g --memory-swap 1g myapp` prevents swap use; OOM killer more predictable.

Follow-up: Cgroups v2 separates kernel memory accounting. How do you monitor just user-memory to track app allocations separately from kernel overhead?

A container runs as user 1000 (not root). Inside: `ls -la /etc/shadow` shows root-owned file (UID 0). Container should not be able to read it. But it reads it successfully: `cat /etc/shadow` works. What's the security hole?

User namespace (userns) is not isolated. Container runs as UID 1000 locally, but in the host's UID namespace. Host filesystem is mounted with root-owned files (UID 0). Without userns remapping, UID 1000 in container = UID 1000 on host. If host's /etc/shadow is world-readable (mode 0644) or readable by GID of user 1000, container can read it. Standard: `stat /etc/shadow` shows mode 0640, owner root:root. If container user is in group "shadow" (GID 42), it can read. The security issue: container user should be remapped. With userns (--userns remap), container's UID 0 maps to host UID 100000 (and UIDs 1-65535 map to 100001-165535). Root inside container = non-root on host. Even if /bin/bash runs as UID 0 inside, it's UID 100000 on host (unprivileged). Fix: (1) Use user namespace remapping: `docker run --userns remap=containeruser myapp`. (2) In docker daemon config `/etc/docker/daemon.json`: add `"userns-remap": "containeruser"` to enable by default. (3) Verify: inside container `id` shows 0 (root), on host `ps aux | grep bash` shows UID 100000. Verify file access: inside container `cat /etc/shadow` fails (EACCES) because container's UID 100000 on host doesn't have read perms. Test: `docker run --userns remap=containeruser alpine:latest cat /etc/shadow` → fails (correct).

Follow-up: User namespace remapping breaks volume mounts (UIDs don't match). How do you fix permission mismatches when sharing host files with userns container?

Container A and Container B run on the same host. Container A can see Container B's mount points in `/proc/1/mountinfo`. Container A tries to `mount -t tmpfs /mnt/container-b` and succeeds—it can now create files in Container B's mount namespace. Is this an isolation failure? How should mount namespaces work?

By default, each container has its own mount namespace (mnt namespace). Mounts made inside Container A (even if it's root) should not appear in Container B. However, the scenario suggests shared mnt namespace. Check: `docker inspect containerA | jq '.HostConfig.IpcMode'` (should be null or "private", not "container:containerB"). If they're isolated, Container A's mount attempts are confined to its namespace—unshared. Verify: inside Container A, `mount -t tmpfs -o size=100m /mnt/test` → shows up in Container A's `df -h`, not in Container B's `df -h`. If Container B sees Container A's tmpfs mount, mnt namespaces are shared (issue). For proper isolation: each container runs with separate mnt namespace (default in Docker). Show: `docker run -d --name a alpine:latest sleep 1000`, `docker run -d --name b alpine:latest sleep 1000`, then `docker exec a mount | wc -l` vs `docker exec b mount | wc -l` → should be similar (each container has its own mount table). The security issue: if mnt namespace is shared (via `--ipc=shareable` or `--network=container:otherA`), mounts do leak. On host: `ls -la /proc//ns/mnt` → same inode for A and B = shared. Different inode = isolated. Fix: ensure containers are started without namespace sharing flags; Docker default is safe.

Follow-up: Your K8s sidecar pattern intentionally shares network namespace between app and sidecar container. Does this also share mnt namespace, or are they independent?

You set CPU limit: `docker run --cpus 1 myapp`. The app can use up to 1 CPU. But how does the kernel enforce this? Show the cgroups mechanism, CPU quota, and what happens when the app exceeds 1 CPU for a microsecond.

CPU limits are enforced via CFS (Completely Fair Scheduler) quotas in cgroups. `--cpus 1` translates to cgroups: `cpu.cfs_quota_us = 100000` (microseconds) and `cpu.cfs_period_us = 100000` (default). This means: in every 100ms period, the container can consume up to 100ms of CPU time (1 full CPU). Scheduler mechanism: (1) Kernel's CFS scheduler maintains per-cgroup runtime tracking. (2) When a container process runs, it consumes CPU time. (3) If runtime exceeds quota within the period, the process is throttled (moved to sleep queue). (4) At period boundary (every 100ms), quota resets. If process tries to use 101ms in a 100ms period, it's throttled after 100ms: put in sleep queue, wakes when next period starts. Show: `cat /sys/fs/cgroup/docker/container-id/cpu.stat` (or v2: `/sys/fs/cgroup/docker.slice/...`) shows `nr_throttled` (times throttled), `throttled_time` (total ms throttled). Run CPU-heavy workload: `docker run --cpus 1 -d alpine:latest stress-ng --cpu 4 --timeout 10s`, then `watch 'docker stats'` shows CPU% capped at ~100%. Inside container: `top` shows load average, but CPU usage is throttled. Verify: `docker stats | grep cpu` shows "1.00" max. Multiple cores: `--cpus 2` allows up to 2 CPUs, quota increases to 200000us. Result: if app exceeds quota, it's not OOM-killed (unlike memory), just throttled—pauses until quota resets.

Follow-up: CPU throttling causes latency spikes. Your app has p99 tail latency SLA. How do you prevent throttling while capping overall resource use?

Network namespace isolation: two containers, default bridge network. Container A `nc -l -p 8080` binds to port 8080 on bridge IP 172.17.0.2. Container B tries `nc 172.17.0.2:8080`. Does it connect? Explain the network path and how bridge networking spans namespaces.

Yes, Container B connects successfully. Both containers have separate network namespaces (isolated TCP/IP stacks) but are connected via the docker0 bridge. When Container A binds to port 8080, it's binding to its veth interface (virtual ethernet, its side of the bridge). When Container B tries to connect, it sends a TCP SYN to 172.17.0.2:8080. The packet travels: Container B's netns → veth device → docker0 bridge → veth of Container A → Container A's netns → listening socket. Show: Container A: `docker run -d --name a alpine:latest nc -l -p 8080`, Container B: `docker run -d --name b alpine:latest sleep 1000`, then `docker exec b nc -zv 172.17.0.2 8080 && echo OK`. Result: OK. The bridge forwards frames based on MAC addresses (L2). Each container has its own network namespace (separate routing table, netfilter rules, socket namespace). Show: `docker exec a ip link show` and `docker exec b ip link show` → different veth interfaces. On host: `brctl show` or `bridge link` shows veth interfaces attached to docker0. Verify isolation: in Container A, `netstat -tulpn` shows 8080 binding. In Container B, `netstat -tulpn` shows nothing (its socket namespace doesn't know about A's socket). But B can still connect via bridge forwarding. Result: network namespaces are isolated for process/socket tables, but bridge allows L2 forwarding between namespaces.

Follow-up: If you run Container C with `--network host`, can it see Container A and B's socket bindings in netstat?

A container's IPC namespace is the same as the host (`--ipc host`). Inside the container, you can `ipcs -m` (list shared memory segments) and see host's IPC objects. Normally, containers should be isolated. How do you verify namespace isolation and prevent accidental sharing?

IPC namespace controls System V IPC objects (shared memory, semaphores, message queues). With `--ipc host`: container shares host's IPC namespace (no isolation). Without: each container gets isolated IPC namespace (default). Verify isolation: (1) `docker inspect container | jq '.HostConfig.IpcMode'` shows "host" or null. (2) on host: `ls -la /proc//ns/ipc` → get inode number. On host: `ls -la /proc/1/ns/ipc` (init process) → compare inode. If inode differs, IPC is isolated. If same, shared. (3) Inside container: `ipcs -m` lists host's SHM if --ipc host, empty if isolated. To audit prod: `docker inspect $(docker ps -q) | jq -r '.HostConfig.IpcMode'` shows "host" for any container that shouldn't (audit finding). Fix: remove `--ipc host` unless needed. In docker-compose, remove `ipc: host`. For Kubernetes: remove `hostIPC: true` from Pod spec. Verify after fix: `docker run alpine:latest ipcs -m` shows empty (correct, container has its own IPC namespace). Security impact: SHM + IPC sharing allows one container to read/modify host processes' shared memory, potential privilege escalation. Best practice: no --ipc host unless explicit system monitoring requirement.

Follow-up: Two containers need to share SHM for IPC without using --ipc host. How do you set up isolated-but-shared IPC?

Your containerized database (PostgreSQL, 8GB data) runs with `-m 16g` memory limit. During backup, memory spike to 15GB (page cache for backup file I/O). Cgroups count page cache as memory usage. When backup ends, page cache should be reclaimed, but memory stays high (14GB) for hours. Is this a memory leak, or is page cache pressure normal?

Page cache behavior is normal. Cgroups count cached pages as memory usage (included in limit). When database reads/writes large backup file, kernel caches pages (optimization for future access). After backup, cache isn't immediately freed—kernel keeps it in case data is accessed again. If app later tries to allocate new memory and page cache exists, kernel reclaims pages (LRU eviction). Memory stays high because no new allocation pressure triggered reclaim. Verify: (1) `docker stats` shows memory high. (2) Inside container: `free -h` or `cat /proc/meminfo` lists cache separately. (3) `docker exec db cat /sys/fs/cgroup/memory.stat | grep cache` shows cache size. (4) After reclaim: `sync && echo 3 > /proc/sys/vm/drop_caches` (on host, as root) forces cache drop (drops page cache, not app memory). After drop, memory usage drops to baseline. This is not a leak; it's the kernel's intelligent caching. For production: (1) don't be alarmed by high memory post-backup. (2) if memory limit is tight and backup + app working set exceed limit, pre-increase limit during backup window. (3) set `vm.swappiness=0` to reduce page cache pressure and prioritize direct reclaim. Verify: `docker inspect db | jq '.HostConfig.Memory'` shows limit, `docker stats` post-backup, then force `drop_caches` and watch memory normalize.

Follow-up: Page cache keeps memory high. For a memory-constrained pod running on K8s, does this trigger eviction even though it's not app memory?