Docker Interview — Rootless Containers and User Namespaces

Security mandate: no processes in containers can run as root (UID 0). Standard Docker daemon allows root-in-container. Implement rootless Docker on a host. Explain what changes in the runtime architecture, and what security guarantees are achieved.

Rootless Docker: run Docker daemon and containers as unprivileged user (e.g., UID 1000), not as root (UID 0). Two layers of isolation: (1) Docker daemon runs as user `dockeruser` (1000). (2) Containers run as unprivileged UID inside container (default container UID 0 is remapped to host UID 100000 via user namespace). Install: (1) on host as root: `dockerd-rootless-setuptool.sh install`. (2) This creates user `dockeruser`, installs Docker daemon to run as that user, configures subuid/subgid mappings (e.g., user `dockeruser` maps UIDs 100000-165535 in container to host UIDs 100000-165535). (3) Client: `docker context use rootless` to connect to rootless daemon socket. Result: (1) container UID 0 inside = host UID 100000 (unprivileged, can't access host resources). (2) if container escapes, attacker is UID 100000 (non-root on host). (3) Docker socket is unprivileged (socket owned by UID 1000), normal users can't access it (unless in `docker` group). Architecture change: (1) standard Docker: daemon (root) + containerd (root) + runc (root). (2) Rootless: daemon (UID 1000) + containerd (UID 1000) + runc (UID 1000). (3) Cgroups: rootless uses cgroups v2 (unified hierarchy), managed by systemd. (4) Network: rootless uses slirp4netns (user-space network stack), not bridge (which requires CAP_NET_ADMIN). Security: (1) container root = host non-root, can't execute privileged operations. (2) kernel exploits in container less impactful (attacker is UID 100000). (3) shared UID namespace: all containers share same UID mapping (different approach from userns remapping). Verify: `docker run alpine:latest id` shows UID 0 inside, `ps aux | grep alpine` shows actual UID 100000 on host. Audit: `ps aux | grep dockerd` shows daemon UID 1000 (not 0).

Follow-up: Rootless Docker can't use bridge networking (needs CAP_NET_ADMIN). How do you configure networking for rootless containers?

Container root (UID 0 inside) needs to read host file `/etc/config.json` (UID 0 on host, mode 0600). File is owned by host root and readable only by root. Container UID 0 (mapped to host UID 100000) can't read it (permission denied). How do you grant access to sensitive host files from rootless container?

User namespace remapping creates UID mismatch: container UID 0 ≠ host UID 0. File `/etc/config.json` is owned by host root (UID 0), mode 0600. When container UID 0 (mapped to host 100000) tries to read, kernel checks: file owner = 0, process UID = 100000, not equal → denied. Solution: (1) Change file ownership on host: `chown 100000:100000 /etc/config.json` (so mapped UID owns it). Caveat: requires privilege on host (only root can chown). (2) Change file mode: `chmod 644 /etc/config.json` (world-readable), but breaks security (everyone reads sensitive config). (3) Bind-mount with ownership: `docker run -v /etc/config.json:/config.json:ro ...` (ro = read-only), but ownership still mismatched. (4) Use Docker volume (managed by daemon) with correct ownership: copy file into volume with `docker cp` or `docker run` with `--user`, then volume is owned by daemon UID, accessible. (5) Best practice: store config in container image or pass via environment variable (not mounted from host). If must mount: (1) on host, create config file owned by UID 100000 (or make world-readable if config isn't sensitive). (2) mount into container with `:ro` flag. Result: container reads config without write access, least privilege. Verify: inside rootless container, `cat /config.json` works. Test: `docker run -v /etc/config.json:/config.json:ro -u 0:0 alpine:latest cat /config.json` (should work after chown on host). For prod: store secrets in managed storage (Docker secrets, Kubernetes secrets), not host files.

Follow-up: You use Ceph or NFS shared storage for container volumes. Does user namespace remapping affect UID/GID mapping on remote storage?

Kubernetes cluster with rootless containerd. You deploy a Pod that requires `privileged: true` (for kernel module loading, direct hardware access). Rootless mode should prevent this. Explain the enforcement and any workarounds.

Rootless containerd enforces that all containers are unprivileged (no CAP_SYS_ADMIN, no access to host namespaces). When Pod requests `privileged: true`, containerd rejects it. Enforcement: (1) OCI spec check in containerd: if `privileged: true`, containerd verifies rootless mode is disabled. In rootless mode, it rejects or downgrades to unprivileged. (2) Kubelet also enforces PodSecurityPolicy: `privileged: false` in PSP blocks privileged requests. (3) Result: Pod startup fails with error "privileged mode not supported in rootless". Workaround (not recommended, defeats rootless purpose): (1) Switch to standard (root) Docker daemon on that host (not rootless). (2) Or: grant specific capabilities instead of full privileged mode: `securityContext.capabilities.add: [SYS_MODULE]` (kernel module loading). Rootless supports capabilities, but in remapped namespace (safer than privileged). (3) For hardware access (GPU, NIC): use device plugin abstraction, not privileged mode. Plugin manages device access safely. For production: (1) rootless + unprivileged = secure, but limits some workloads. (2) For workloads needing privileged mode, use separate host with rootful containerd (behind strict access control). (3) Audit: `kubectl get pod -o json | jq '.spec.securityContext.privileged'` detects privileged request; alert if attempted in rootless cluster. (4) Use Pod admission webhook to block privileged mode cluster-wide. Verify: try deploying privileged Pod in rootless cluster, watch pod status → error. Try with `--cap-add`, succeeds (if PodSecurityPolicy allows).

Follow-up: Rootless containerd with CAP_SYS_MODULE (kernel module loading). Is this safe, or does it break rootless security model?

Rootless Docker on multiple hosts. Host A: subuid range 100000-165535 for user `dockeruser`. Host B: subuid range 200000-265535 for same user. Two containers (one on Host A, one on Host B) try to share a volume via NFS. UID mapping mismatch causes permission errors. How do you normalize UID/GID across hosts?

Subuid mismatch: Host A maps container UID 0 → host 100000. Host B maps container UID 0 → host 200000. NFS file created on Host A by container UID 0 is owned by 100000 on NFS. When Host B's container (UID 0 = 200000 on host) tries to read, NFS checks: file owner = 100000, process UID = 200000 → denied. Solution: (1) Allocate matching subuid ranges on all hosts: configure /etc/subuid to use same base (e.g., all hosts use 100000-165535). This requires coordination at infrastructure level. (2) Use NFS idmap: NFS can perform UID translation: exports with `anonuid=100000` maps all anonymous users to UID 100000. Both containers map to same UID. (3) Disable user mapping on NFS: export with `no_root_squash no_all_squash` (risky, allows container root to be host root). (4) Use a shared volume driver: some volume plugins (e.g., Portworx, Rex-Ray) abstract UID mapping. (5) In Kubernetes: use managed storage with pod-specific UID (Persistent Volume Claims with custom UID mapping). Practical: (1) standardize subuid on all cluster nodes: all use 100000-165535 for rootless user. (2) document in infrastructure-as-code. (3) for NFS: use idmap server (`nfs-utils` includes `idmapd`), configure `/etc/idmapd.conf` on all hosts with matching domain. NFS then translates UIDs across hosts. Result: container UID 0 consistently maps to host 100000 across all hosts, NFS file access works. Verify: `getent passwd $(id -u 100000)` on both hosts shows same username (should be rootless daemon user). Test: create file on Host A, read on Host B via NFS-mounted volume.

Follow-up: NFS idmapping is complex and slow (daemon call per UID lookup). For large container clusters, is there a better approach?

Rootless container needs to use `docker exec` to run commands. But `docker` command-line tool connects to daemon socket (~/dockerd.sock, owned by rootless daemon user). Regular user can't access it. Can multiple users on the same host each run rootless Docker independently?

Yes, multiple users can each run rootless Docker, but they must have separate daemon sockets. Each user needs: (1) subuid/subgid allocations (in /etc/subuid, /etc/subgid). (2) rootless Docker daemon installed for that user. (3) separate socket (usually ~/.docker/run/docker.sock or systemd socket). Setup: (1) as root, allocate subuid to users: `echo 'user1:100000:65536' >> /etc/subuid && echo 'user1:100000:65536' >> /etc/subgid`. (2) user1 installs rootless Docker: `dockerd-rootless-setuptool.sh install` (runs as user1 daemon). (3) user1's daemon socket: `/run/user/$(id -u user1)/docker.sock`. (4) user1 can run `docker ps`, `docker run`, etc. (5) user2 does same (separate subuid, separate daemon, separate socket). Isolation: (1) user1's containers are isolated from user2's containers (separate daemon, separate uid namespace). (2) user1 can't access user2's daemon socket (file permissions). (3) if user1's container escapes, attacker is still user1 (unprivileged). Architecture: multiple rootless daemons (one per user) running simultaneously, each with its own cgroup hierarchy, socket, container namespace. Verify: (1) `ps aux | grep dockerd` shows multiple dockerd processes (one per user). (2) `ls -la /run/user/*/docker.sock` shows socket for each user. (3) `sudo -u user1 docker ps` lists user1's containers. `sudo -u user2 docker ps` lists user2's containers (non-overlapping). Limitations: (1) performance overhead (multiple daemons, multiple slirp4netns processes). (2) cgroup resource limits: need cgroups v2 (hybrid mode doesn't work well). (3) complex management. For production: single rootless daemon shared among team is simpler; fine-grained access control via RBAC if using Kubernetes.

Follow-up: Each user's rootless daemon runs 50 containers. Subuid/subgid is 65536 per user. Can 50 unique container UIDs fit, or do they conflict?

Container in rootless Docker calls `getuid()` and gets 0. But `stat /etc/passwd` shows owner UID 65534 (nobody). Inside container: user 0 (root) shouldn't be able to read `/etc/passwd`. But it can. Why? Is this a user namespace isolation failure?

This is expected behavior and not a failure. User namespace remapping affects file ownership mapping, not file permissions inside the container. Inside container: (1) `getuid()` returns 0 (container UID 0). (2) `/etc/passwd` (in container rootfs) is created during image build as world-readable (mode 0644, typical for passwd). (3) Container UID 0 (root) reads `/etc/passwd` successfully because file is world-readable (others can read). This is correct. If `/etc/passwd` had mode 0600 (root read-only), container root could still read it (owns file by virtue of being root in container namespace). User namespace ensures: (1) container UID 0 ≠ host UID 0. (2) if container root tries to access host files (via escape), it's host UID 100000 (unprivileged). (3) file ownership mapping: container-owned file (UID 0 inside) is UID 100000 on host (viewable via `ls -l /var/lib/containerd/...`). Security model: container UID 0 is privileged within container namespace (can read world-readable files, modify its own files), but when crossing namespace boundary (escaping), it becomes UID 100000 (unprivileged on host). Verify: (1) inside container: `id` shows 0, `cat /etc/passwd` works (world-readable). (2) inside container: `echo hello > /root/.ssh/authorized_keys && stat $_ ` shows UID 0. On host: `stat /var/lib/containerd/... ` shows UID 100000. (3) on host: `sudo -u \#100000 cat /var/lib/containerd/...` (UID 100000 can read only if world-readable or owned by 100000). This is correct isolation.

Follow-up: Container root (UID 100000 on host) tries to access `/etc/shadow` (mode 0640, owner root:shadow). Even though container is UID 0, it can't read. Why is namespace mapping sometimes insufficient?

Rootless Kubernetes cluster: Kubelet runs rootless, uses rootless containerd. All workloads are unprivileged. But PersistentVolume (hostPath type) points to `/var/log` on host. Container mounts it, tries to write logs. Permission denied (volume owner is host root). Container can't write even though it's UID 0 (remapped to 100000 on host). Solve this.

HostPath volumes with rootless containerd: volume is mounted from host FS directly, UIDs don't remap (remapping is inside container, HostPath is host resource). `/var/log` on host is owned by root:root. Container UID 0 (host UID 100000) can't write to root-owned directory. Solution: (1) change host directory ownership: `sudo chown 100000:100000 /var/log` (or better, subdirectory `sudo chown 100000:100000 /var/log/app-logs`). Now container (UID 100000) can write. (2) Change directory mode: `sudo chmod 777 /var/log` (bad security). (3) Use Pod volume instead of HostPath: `emptyDir` volume (not backed by host fs, container can write freely). (4) Don't use HostPath for rootless (design limitation). (5) If you must use HostPath, use init container to fix permissions: `initContainers: - name: fix-perms, image: alpine, command: [chmod, 777, /var/log]` (requires rootful for chown/chmod on host fs, won't work in rootless). Best practice: (1) for logs in rootless cluster, use `emptyDir` or centralized logging (Loki, ELK). (2) avoid HostPath with rootless; use Persistent Volumes backed by managed storage (NFS, cloud storage) with proper UID handling. (3) if HostPath necessary, pre-provision on host with correct ownership (UID 100000). Verify: inside Pod, `id` shows 0, `mount | grep var/log` shows mounted HostPath, `touch /var/log/test` succeeds (after ownership fix). Audit: check Pod manifests for HostPath + rootless combination, document why (if justified).

Follow-up: Multiple rootless clusters on same host, each with different Kubelet UID. HostPath between clusters conflicts (one cluster's UID 100000 ≠ another's UID 150000). How do you isolate HostPath resources per cluster?

Rootless Docker on a host: `dockeruser` (UID 1000) runs daemon. Subuid range: 100000-165535. You `docker cp` a file from host to container. Inside container, file ownership changes. Trace the UID mapping: what UID does the file have on host vs. inside container? Explain the remapping.

UID remapping happens at container boundaries, not inside Docker daemon. Trace: (1) host: you (UID 1000) own file `myfile` (UID 1000:GID 1000). (2) `docker cp myfile container:/data` → Docker daemon (running as UID 1000) reads file, sends to container. (3) Inside container: file appears as UID 0:GID 0 (root owns it, from container's perspective). (4) On host (from outside): `ls -l /var/lib/docker/volumes/container/data/myfile` shows UID 100000:GID 100000 (remapped). Remapping mechanics: Docker daemon (UID 1000) uses userns (user namespace) to remap UIDs. Container's UID 0 maps to host UID 100000. When container writes file with UID 0, kernel remaps to 100000. When host reads file at `/var/lib/docker/...`, sees UID 100000. Config: `/etc/subuid` has `dockeruser:100000:65536` (UID 1000 can map UIDs 100000-165535 to container UIDs 0-65535). Inside container: UID 0-65535 exist. On host: they map to UIDs 100000-165535. Show: (1) on host: `id` shows 1000. (2) Inside container: `id` shows 0, but `stat /data/myfile` shows UID 0 (in container view). (3) On host, outside container: `stat /var/lib/docker/... ` shows UID 100000. (4) Verify mapping: `cat /proc/$(pidof dockerd)/uid_map` shows mapping rules. For `docker cp`: file remapping is transparent to user. To avoid permission issues, ensure host file is readable by UID 1000 (daemon UID) before copy. If host file is UID 0 (root), docker cp might fail unless run as root or file is world-readable.

Follow-up: You `docker cp` a file into rootless container, then app modifies it. When you `docker cp` back out, is file ownership preserved?