Container runs HAProxy (non-root user `haproxy`). HAProxy needs to bind to port 80 (privileged port < 1024). Linux rule: only root (UID 0) can bind to ports < 1024. Container user is not root. How do you grant port binding permission without elevating to full root?
Use Linux capabilities: specifically `CAP_NET_BIND_SERVICE`. By default, containers run with limited capabilities (not all 41 Linux capabilities). `CAP_NET_BIND_SERVICE` allows non-root users to bind to ports < 1024. Grant it: `docker run --cap-add=NET_BIND_SERVICE --user haproxy:haproxy haproxy:latest`. Result: HAProxy runs as non-root (UID 165, GID 165) but can bind to port 80 (kernel checks capabilities, not UID). Verify: (1) inside container: `getcap /usr/sbin/haproxy` shows CAP_NET_BIND_SERVICE (if binary has capability set). (2) `capsh --print | grep net_bind` shows capability in ambient set. (3) `ss -tulpn | grep :80` shows port 80 listener as haproxy user (not root). Dockerfile: use `setcap cap_net_bind_service=+ep /usr/sbin/haproxy` during build to bake capability into binary, or rely on --cap-add at runtime. Security: (1) CAP_NET_BIND_SERVICE is narrow (only port binding), safer than full root. (2) Still require user to be `haproxy` (not root). (3) For multiple services, grant only necessary capabilities: HAProxy gets CAP_NET_BIND_SERVICE, privileged-port-binder gets it, app stays unprivileged. Verify: `docker run -u haproxy:haproxy haproxy haproxy -c /etc/haproxy/haproxy.cfg` without --cap-add fails (EACCES on port 80). With --cap-add, succeeds. Audit: `docker inspect container | jq '.HostConfig.CapAdd'` shows granted caps.
Follow-up: Your HAProxy needs to reload config on signal (SIGHUP) which requires file writes to temp. What minimal capability set allows config reload without breaking it?
Container needs to monitor system resources (CPU, memory, disk). Monitoring app tries to read `/proc/stat`, `/proc/meminfo`, `/proc/diskstats`. These are readable by unprivileged users. But container also tries `ioctl(SIOCGIFCONF)` to enumerate network interfaces—requires `CAP_NET_ADMIN`. Why does it fail without the cap, and how do you add it selectively?
`ioctl(SIOCGIFCONF)` and other network diagnostics require `CAP_NET_ADMIN`. Reading `/proc/stat` is world-readable (0444), so unprivileged user can read. But network interface enumeration via ioctl requires privilege. Container fails on ioctl with EPERM (operation not permitted). Grant selectively: `docker run --cap-add=NET_ADMIN monitoring:latest`. Result: app can call ioctl, succeeds. Why selective? (1) CAP_NET_ADMIN is powerful: allows interface config changes, firewall manipulation, etc. Granting it opens attack surface. (2) For monitoring, you only need to read interfaces, not modify. Better: use `/proc/net/` and `/sys/class/net/` (readable without caps) instead of ioctl. Example: `cat /proc/net/dev` lists interfaces and stats (no cap needed). Dockerfile: wrap monitoring app to use `/proc` instead of ioctl, avoid needing CAP_NET_ADMIN. If ioctl is unavoidable (vendor tool), grant CAP_NET_ADMIN but combine with other restrictions: `--read-only` root fs (app can't modify), `--security-opt no-new-privileges` (app can't gain more caps). Verify: (1) without --cap-add: `docker run alpine:latest ip link show` vs `docker run --cap-add=NET_ADMIN alpine:latest ip link show` (former succeeds via /proc, latter via ioctl/netlink). (2) Check: `capsh --print | grep net_admin` inside container with --cap-add. Audit: `docker inspect container | jq '.HostConfig.CapAdd'` lists added caps.
Follow-up: CAP_NET_ADMIN is too broad for your use case. Are there fine-grained capabilities, or should you use seccomp filters instead?
Container tries to use `ptrace()` to attach to a process for debugging (e.g., `gdb` or `strace`). Fails with EACCES. Debugging is needed for prod troubleshooting. Enable ptrace selectively without full debug access.
`ptrace()` requires `CAP_SYS_PTRACE`. Without it, only root can debug. Grant: `docker run --cap-add=SYS_PTRACE container-id`. Result: app can attach debuggers to its own processes. Verify: inside container, run `strace ./app` (without --cap-add, fails; with it, works). Why selective? (1) CAP_SYS_PTRACE allows debugging any process, potential security issue (one container debugs another). (2) Better: only allow app to debug itself (same PID, same container). Implementation: (1) Grant CAP_SYS_PTRACE, but combine with `--security-opt seccomp=default` (seccomp profile filters ptrace to self only). (2) Or use custom seccomp profile: `"ptrace": ["PTRACE_ATTACH", "PTRACE_PEEKTEXT"]` to allow only safe ptrace operations. (3) For production: enable in staging/dev, not prod (reduces attack surface). Alternatively: use non-intrusive monitoring (logs, metrics, APM) instead of live ptrace. Verify: `docker run --cap-add=SYS_PTRACE container strace -e trace=write ./app` works. Without cap: fails. Audit: `docker inspect container | jq '.HostConfig.CapAdd[]'` shows SYS_PTRACE. Security: if container is compromised, CAP_SYS_PTRACE allows attacker to debug host processes (if --security-opt privileged applied). Mitigate: combine with --security-opt no-new-privileges, read-only fs, resource limits.
Follow-up: You need strace for debugging, but `--cap-add SYS_PTRACE` breaks your security policy. What's a safer debugging approach?
Seccomp: Docker has a default seccomp profile that blocks certain dangerous syscalls (`execve` from within container, `mount`, `kexec`, etc.). Your app needs to fork+exec child processes. The default profile allows this. But you want to audit what syscalls are being made. How do you trace and selectively allow/deny syscalls?
Seccomp (Secure Computing Mode) filters syscalls at kernel level. Default Docker profile blocks ~40 dangerous calls, allows the rest. To audit: (1) Run container with `--security-opt seccomp=unconfined` (no filtering), trace syscalls: `strace -f -e trace=syscalls ./app` or `docker run --security-opt seccomp=unconfined app` and monitor dmesg for audit logs. (2) Check kernel logs: `journalctl -f | grep seccomp` shows blocked syscalls. (3) Tools: `seccomp-tools` (Ruby gem) to parse seccomp profiles, `bpftrace` to trace syscalls dynamically. To selectively allow: (1) Create custom seccomp profile (JSON format), add syscalls to whitelist. Example: default profile blocks `execve`, but your app needs it. Create profile: `{"defaultAction": "SCMP_ACT_ERRNO", "defaultErrnoRet": 1, "archMap": [...], "syscalls": [{"name": "fork", "action": "SCMP_ACT_ALLOW"}, {"name": "exec", "action": "SCMP_ACT_ALLOW"}, ...]}`. (2) Pass to Docker: `docker run --security-opt seccomp=/path/to/profile.json app`. Result: fork + exec allowed, dangerous syscalls still blocked. Verify: (1) inside container, `fork && exec` works. (2) Try `mount /dev/sda /mnt` → fails (EACCES). (3) Run with unconfined, `mount` works. To see which syscalls are being blocked: `docker run --security-opt seccomp=default app 2>&1 | grep -i 'Permission denied'` or check audit logs on host. For production: use default profile (safe), enable custom profile only for specific apps that need exceptions. Audit trail: log all seccomp violations for compliance.
Follow-up: App spawns child process and calls execve. Default seccomp blocks execve from within container. Your app needs it, but you don't want to unblock globally. How do you allow execve only for specific processes?
Container runs a web app. Attacker gains shell inside and tries `mknod /dev/sda1 b 8 0` to access host disk device files. Should fail (dangerous). But attacker also needs `mknod /tmp/fifo p 0 0` (FIFO) for inter-process communication (legitimate). How do you block device creation but allow FIFO?
The `mknod` syscall creates device files (and FIFOs). By default, Linux allows unprivileged mknod in container (but only if capability CAP_MKNOD is set, which is usually dropped in containers). Verify: inside container, `capsh --print | grep mknod` (should be absent). If present, `mknod /dev/sda` is possible (exploit). For device blocking: (1) Default: drop CAP_MKNOD capability. Docker does this by default (`--cap-drop=MKNOD` is implicit). Verify: `docker inspect container | jq '.HostConfig.CapDrop'` shows MKNOD dropped (or inherited from default drop list). Test: `docker run alpine:latest mknod /tmp/dev b 8 0` → fails (EACCES, operation not permitted). (2) For FIFO (legitimate need): use `mkfifo` (wrapper around mknod for FIFOs), or just use regular files if viable. FIFOs still need CAP_MKNOD at kernel level (same syscall). If you need FIFOs but block device mknod: (1) use seccomp profile with granular filter: allow `mknod` syscall only for FIFO type (S_IFIFO), block for block/char devices. Seccomp allows filtering by syscall args. (2) Custom profile: `{"name": "mknod", "action": "SCMP_ACT_ERRNO", "args": [{"index": 1, "op": "SCMP_CMP_MASKED_EQ", "value": "0170000", "valueTwo": "0010000"}]}` (only allow S_IFIFO). (3) Or: re-architect: avoid FIFOs, use Unix sockets (more secure, more flexible). Verify: (1) default: `mknod /dev/sda` fails, `mkfifo /tmp/fifo` fails (CAP_MKNOD absent). (2) With custom profile: `mkfifo` works, `mknod /dev/sda` fails. Test seccomp: `docker run --security-opt seccomp=/profile.json app` and verify FIFO creation and device block.
Follow-up: Seccomp profile is complex and hard to maintain. If upstream changes syscall usage, profile breaks. How do you generate/auto-update profiles?
AppArmor (or SELinux on RHEL): Linux MAC (Mandatory Access Control) system. Container runs app under AppArmor profile. Profile restricts: file access, network ports, capabilities. App tries to read `/etc/secrets` (outside container's allowed paths). Denied by AppArmor (not by UID/GID). How do you tune AppArmor profile for container without compromising security?
AppArmor is a MAC system on Ubuntu/Debian. Default Docker runs containers with minimal AppArmor profile (allows most syscalls, some restrictions). To add restrictions: (1) Create custom AppArmor profile: `profile docker-app flags=(attach_disconnected,mediate_deleted) { ... /#include
Follow-up: AppArmor profile is inherited from container image (built in). If you update profile on host, existing container doesn't see change (profile fixed at start). How do you hot-reload AppArmor profiles?
Pod runs in Kubernetes with `securityContext: privileged: false`. How many capabilities does it have by default? What's the difference between the capabilities available in Docker vs Kubernetes containers?
Kubernetes Pod with `privileged: false` (default) has similar capabilities to Docker unprivileged container, but with some differences. Docker default: 14 capabilities (NET_BIND_SERVICE, CHOWN, DAC_OVERRIDE, SETFCAP, SETGID, SETUID, NET_RAW, SYS_CHROOT, KILL, AUDIT_WRITE, etc.). Kubernetes default: depends on container runtime (containerd, runc). Most runtimes use Docker's default list. Differences: (1) Kubernetes can customize via SecurityContext: `capabilities.add: [NET_ADMIN]`, `capabilities.drop: [ALL]` (drop all, then selectively add). (2) `privileged: false` enforces non-root (unless UID specified), no host network by default. (3) `seccomp` and `seLinux` options in Kubernetes SecurityContext allow fine-grained control. Verify: (1) in Kubernetes: `kubectl get pod -o json | jq '.spec.securityContext'` shows security settings. Inside pod: `capsh --print | head -20` lists capabilities. (2) Compare: Docker: `docker run alpine:latest capsh --print` vs Kubernetes: `kubectl exec pod capsh --print`. Usually similar unless custom profile applied. For production: (1) set `privileged: false` (enforced by PodSecurityPolicy). (2) add only required capabilities: `capabilities.add: [NET_BIND_SERVICE]`. (3) drop dangerous ones: `capabilities.drop: [SYS_ADMIN, SYS_PTRACE]`. (4) enable seccomp: `securityContext.seccompProfile.type: RuntimeDefault` (uses runtime's default profile). Verify: audit script to check all pods for privileged containers, unexpected caps, seccomp disabled.
Follow-up: Kubernetes Pod adds CAP_SYS_ADMIN for legitimate need. Does this break the security model, or can you mitigate it?
Your seccomp profile blocks `chmod()` syscall (prevents file permission changes). But your app needs to create a temporary file with mode 0600 (restrictive permissions). Seccomp blocks the syscall before it runs. How do you audit what syscalls your app needs without breaking the profile?
Use seccomp audit mode (log violations without blocking) to identify needed syscalls. Steps: (1) switch profile to audit mode: instead of `"defaultAction": "SCMP_ACT_ERRNO"` (block), use `"defaultAction": "SCMP_ACT_AUDIT"` (log, allow). (2) Run container: `docker run --security-opt seccomp=/audit-profile.json app`. (3) Collect syscall logs: `docker logs app 2>&1 | grep AUDIT` or check kernel audit logs: `journalctl -u audit | grep chmod`. (4) Identify blocked syscalls needed by app. (5) Update profile: add syscalls to whitelist: `{"name": "chmod", "action": "SCMP_ACT_ALLOW"}`. (6) Switch to enforce mode: change `"defaultAction": "SCMP_ACT_ERRNO"`. (7) Retest. For production: (1) don't use audit mode (allows all, no enforcement). (2) use tool `aa-logprof` (from AppArmor) or `generate-seccomp-profile.sh` to auto-generate profiles from logs. (3) iteratively add necessary syscalls, removing dangerous ones. (4) test in staging before prod. Verify: (1) audit mode: app runs, logs syscalls. (2) enforce mode: app runs, denied syscalls are blocked (EPERM). (3) compare: if audit allows syscall and app works, but enforce blocks it, add to whitelist. Common needed syscalls: `open`, `read`, `write`, `mmap`, `mprotect`, `chmod`, `chown`, `setuid`, `socket`, `connect`. Dangerous syscalls typically blocked: `execve` (from within container), `mount`, `umount`, `kexec`, `bpf`, `ptrace`.
Follow-up: Audit mode works but is slow (every syscall logged). For large workloads, logging overhead impacts performance. How do you optimize seccomp logging?