Docker Interview — Container Runtime Architecture: containerd, runc, OCI

You run `docker run alpine:latest /bin/sh`. Break down exactly what happens from command entry to your shell prompt appearing. Where do containerd and runc fit? What happens to the image file on disk?

(1) Docker CLI parses args, calls containerd gRPC API. (2) containerd checks if `alpine:latest` image exists locally (stored in `/var/lib/containerd/io.containerd.content.v1.content/` as compressed blobs). If missing, pulls from registry. (3) containerd creates a snapshot (CoW layer on top of image rootfs) in `/var/lib/containerd/io.containerd.snapshots.overlayfs2/`. (4) containerd creates a container spec (OCI Runtime Specification JSON: rootfs mount, mounts, cgroups, namespaces). (5) containerd spawns runc via OCI Runtime: `runc create && runc start`. (6) runc reads the spec, calls Linux kernel syscalls: `unshare()` for namespaces (pid, mount, network, ipc, uts, user), `setrlimit()` for cgroups (memory, cpu limits), `chroot()` to rootfs. (7) runc `exec()` the init process (/bin/sh). (8) First PID (PID 1 in container) runs in isolated namespaces. Show this: `docker run -d alpine:latest sleep 1000 && ps aux` on host shows containerd-shim, runc processes; inside container `ps aux` shows only sleep as PID 1. Verify layers: `docker inspect alpine:latest | jq '.RootFS.Layers'` lists image layers (immutable); `docker inspect $(docker ps -q) | jq '.GraphDriver'` shows container layer (mutable CoW).

Follow-up: If runc fails to exec the init process, where does the error surface? Can you recover the container state?

You run a container with `--privileged` flag. Compare to standard unprivileged container. What kernel capabilities do you gain? Where in the runtime is this enforced? How would you audit if a running container is actually privileged?

`--privileged` disables cgroup restrictions and removes capability restrictions. In runc/OCI spec, this sets `privileged: true` in the config. Kernel capabilities: unprivileged container has CAP_NET_BIND_SERVICE, CAP_CHOWN, etc. (curated list). Privileged container has all 41 capabilities (CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE, etc.). Additionally, --privileged exposes host devices (dev/ is bind-mounted, not recreated). Enforcement: in runc's `libcontainer`, the OCI spec's `privileged` field is translated to: (1) setuid(0), (2) setcap capabilities to ambient set, (3) disable seccomp filter. To audit: (1) run `docker inspect $(docker ps -q) | jq '.HostConfig.Privileged'` (shows true/false). (2) Inside container: `getcap /proc/self/exe` shows capabilities; privileged shows all, unprivileged shows subset. (3) Check device access: `ls -la /dev/sd*` (privileged sees host disks, unprivileged sees only container pseudo-devices). (4) Audit kernel: `capsh --print` inside container lists ambient capabilities. For prod: use `kubectl get pod -o json | jq '.spec.containers[].securityContext.privileged'` to scan for privileged pods.

Follow-up: A container needs CAP_SYS_ADMIN for one operation but not all 41 caps. How do you grant selective caps without --privileged?

You run `docker run -m 512m myapp:latest`. Explain how memory limits are enforced from the OCI spec down to Linux kernel cgroups. What happens when the app tries to allocate 600MB? Show the Linux kernel trace.

(1) Docker CLI parses `-m 512m`, writes to OCI spec: `"resources": { "memory": { "limit": 536870912 } }` (bytes). (2) runc reads spec, calls `cgroupv2` (or v1) kernel interface: writes to `/sys/fs/cgroup/docker/container-id/memory.max` = 536870912. (3) Kernel memory manager enforces: when app `malloc(600MB)`, kernel page-fault handler checks cgroup limit. If usage + new allocation > 512MB, kernel triggers reclaim (LRU eviction) or OOM killer. (4) If OOM persists, kernel sends SIGKILL to PID 1. (5) runc catches exit, Docker container stops with exit code 137 (128 + 9 = SIGKILL). When app tries to allocate: kernel calls `__alloc_pages_slowpath()` → memory pressure check → `kswapd` tries reclaim → if still over limit, OOM killer fires. Trace: `dmesg | tail -20` on host shows `Killed process PID (app) total-vm:700000kB, anon-rss:512000kB`. Inside container, app gets SIGSEGV or malloc() returns NULL (depends on OOM killer vs. malloc interception). Verify: `docker run -m 512m --name test myapp && docker inspect test | jq '.State'` shows exit code. Test: `stress-ng --vm 1 --vm-bytes 600m --timeout 10s` in a 512m container → OOM kill after ~5s.

Follow-up: The OOM killer picks the "worst offender" PID. How do you ensure your critical service PID is never killed in a memory-constrained pod with multiple processes?

containerd stores image layers in `/var/lib/containerd/io.containerd.content.v1.content/`. You have 3 containers running the same base image (node:18). How many copies of the base image are stored on disk? Show the CoW mechanism that prevents 3x storage waste.

One copy of the base image is stored (immutable layers in content store). Each container gets a snapshot (CoW layer) on top. Base image is 400MB. Three containers + 3 CoW snapshots don't use 1.2GB; they use 400MB + 3×(container-layer size). CoW mechanism: OverlayFS (or AUFS) stacks layers. Base image is the "lower" read-only layer. Container layer is the "upper" writable layer. When container writes a file, OverlayFS copies the file from lower to upper (copy-on-write), then writes. Result: read-only layers shared across containers, mutable layers isolated. Show: `docker run node:18 /bin/sh` on 3 terminals, in each: `df -h` shows same base image (same inode), `docker diff container-id` shows container-specific writes. Verify storage: `docker system df` shows 3 containers sharing 1× base image. In containerd directly: `ls -la /var/lib/containerd/io.containerd.snapshots.overlayfs2/` shows 3 snapshot dirs (one per container). Each snapshot has `work/` and `merged/` dirs; `lower` symlink points to base image layers. Result: 1× storage for base, 3× for deltas, not 3× for base. For large-scale deployments (1000 containers), this saves 400GB × 999 = 400TB of disk.

Follow-up: OverlayFS performance degrades when container layer has 100k files (inode lookup slow). How do you monitor and detect this degradation?

You run a container with `--network host`. Compare to default bridge network. What's the runtime difference? Can the container still be isolated from host processes? Show a debugging approach to verify network namespace isolation.

`--network host` skips network namespace isolation (no `unshare(CLONE_NEWNET)`). Container uses host's network interfaces directly. Container sees host's eth0, listens on host ports. To default bridge: container gets its own veth interface, connected to docker0 bridge, isolated network namespace. Runtime difference: in OCI spec, `network.mode` is "host" vs "bridge". runc behavior: (1) host mode: skip `setns(CLONE_NEWNET)`, bind to host ports directly. (2) bridge mode: create veth pair, move to container's netns, bridge to docker0. Verification: (1) `ip netns list` on host. Default container creates netns, host-network container doesn't. (2) `docker run --network host alpine:latest ip link show` vs `docker run alpine:latest ip link show`. Host mode shows eth0 directly; bridge shows eth0 + lo. (3) Inside host-network container: `ps aux` still shows container processes isolated (PID namespace untouched), but `netstat -tulpn` shows ports you didn't explicitly bind (listening on host ports). Debugging: `nsenter -n -t ip netns list` to verify netns isolation; if empty for host-network container, it's using host netns. Result: host-network container can see/use all host ports, more flexible but less isolated. For prod: avoid --network host unless necessary (e.g., HAProxy front-end); use bridge + port mapping for isolation.

Follow-up: A container with host network shouldn't listen on privileged ports (< 1024) without root. How does the runtime enforce this?

runc (the OCI runtime) is responsible for actually starting containers. Explain the journey of a container's first process: from OCI config.json → runc create → runc start → first syscall executed. What if runc crashes mid-start?

(1) containerd prepares OCI config.json (bundle directory with rootfs/ and config.json). (2) containerd calls `runc create bundle-dir` → runc reads config.json, sets up cgroups, network namespaces, mounts rootfs, creates a container state directory, pauses before exec. (3) runc fork()s, child process enters namespaces (unshare()), sets cgroups via `cgroup v2 fs`, chroot()s into rootfs, clears environment (per spec), calls exec(init-process, args). (4) PID 1 starts. First syscall: depends on init process (shell, app, etc.). Show: `runc create mycontainer /bundle && runc start mycontainer`. If runc crashes mid-start (e.g., OOM during cgroup setup): containerd detects runc exit code non-zero, fails container creation, attempts cleanup. Container state is left in `/var/lib/runc/` in partial state. Recovery: `runc list` shows zombie container, `runc delete` cleans it. Verify: instrument with `strace runc create` to see syscalls (unshare, cgroup write, chroot, exec). For production: runc is battle-tested; crashes are rare but handleable via containerd's supervise loop—if runc dies, containerd respawns it or fails container gracefully. Monitor: `docker events | grep container` shows creation state transitions; failed creation shows "error" event.

Follow-up: If the rootfs (base image) is corrupted and the init process binary doesn't exist, when is this discovered—at runc create or runc start?

You have a containerd daemon on a remote host. Your local Docker CLI connects via Docker socket forwarding (ssh -L). Trace the full call chain from `docker run` locally to a process starting on the remote host. Where does containerd sit, and where does runc sit?

(1) Local: `docker run` → Docker CLI sends gRPC to local Docker daemon (via /var/run/docker.sock). (2) Local Docker daemon is configured to forward to remote containerd: `--containerd` flag or config points to remote host's containerd socket (tunneled via SSH). (3) Docker daemon sends request to remote containerd via gRPC. (4) Remote containerd (on remote host) parses image ref, checks local storage or pulls. (5) Remote containerd allocates container ID, creates OCI bundle (config.json + rootfs snapshot). (6) Remote containerd spawns runc (local to remote host): `runc create && runc start`. (7) Remote runc reads config.json, starts process on remote host. (8) Local Docker CLI is now disconnected; process runs on remote. Show: `ssh -L /tmp/docker.sock:/var/run/docker.sock remote-host`, then `docker -H unix:///tmp/docker.sock run alpine:latest ps`. Full chain: local Docker CLI → SSH tunnel → remote containerd → remote runc → process on remote. Note: runc must be on the same host as containerd; you can't run runc remotely from Docker daemon. Verify: `docker -H tcp://remote:2375 run myapp && docker exec hostname` returns remote hostname. For production: this is the basis of Docker-in-Docker (DinD) and Kubernetes remote execution.

Follow-up: If the SSH tunnel drops mid-container-start, does the remote process continue running? Can you reconnect and manage it?

A container has both `--init` and custom `--entrypoint /myapp`. Explain what PID 1 actually is in this case. How does `--init` (tini) handle zombie processes differently from /myapp as direct PID 1?

Without `--init`: PID 1 is /myapp directly. If /myapp spawns child processes that exit but /myapp doesn't wait() on them, those children become zombie processes (SIGCHLD not reaped). Show: run container, spawn child, exit child without parent wait → `ps aux` shows `` process. Zombies consume PID slots, memory, and eventually exhaust PID limit. With `--init` (tini or similar): PID 1 is tini binary (init substitute). tini is configured to launch /myapp as PID 2. tini's main loop: wait() on all children, reap zombies, forward signals to /myapp (SIGTERM, SIGKILL, etc.). Result: zombies are immediately reaped. Show: `docker run --init myapp` → `ps aux | grep tini` shows tini as PID 1, /myapp as PID 2. When child exits, tini reaps it (status code handled). Runtime difference: in OCI spec, `init: true` tells runc to use init substitute (tini on Linux). runc embeds tini binary or uses system's init. Result: with --init, PID count stable; without, zombies accumulate until container exits. For production: always use `--init` for container images that spawn subprocesses (or wrap entrypoint in a shell script that handles SIGCHLD). Verify: `docker top container-id` shows process tree; tini should be PID 1.

Follow-up: Your app is a Go binary compiled with cgo. Does cgo require special handling for zombie reaping, or does --init solve it?