Docker Interview — Container Resource Limits and OOM Kills

Java app in container: `-m 1024m` (1GB limit), JVM arg `-Xmx512m` (512MB heap). App crashes with OOM kill (exit code 137). You inspect logs: heap usage peaked at 400MB, well below the 512MB limit. But total container memory hit 1GB and triggered OOM killer. What's consuming the extra 600MB?

JVM heap (`-Xmx`) is only one component of JVM memory. Other consumers: (1) Off-heap memory: direct ByteBuffers, NIO buffers, memory-mapped files (untracked by heap GC). (2) Code cache: compiled JIT methods. (3) Metaspace: class metadata. (4) Thread stacks: each thread ~1MB default. (5) Kernel memory: page cache for files read by app. With `-Xmx512m`, heap is 512MB. Metaspace default is unlimited or 75% of remaining memory. With 50 threads: 50 × 1MB = 50MB. Direct buffers (e.g., Netty, Kafka): can be 100-200MB. Result: 512MB (heap) + 50MB (threads) + 150MB (Metaspace) + 200MB (direct buffers) + kernel cache = ~912MB to 1GB+. When app allocates more direct buffers or spawns more threads, total exceeds 1GB → OOM killer. Diagnose: (1) Inside container before crash: `jcmd VM.native_memory summary` (requires Java diagnostic tools). (2) `ps aux | grep java` shows RSS memory (resident set size, total RAM used). (3) `cat /proc//smaps | grep Rss` sums all memory regions. Fix: (1) Increase container memory: `-m 2048m` (allow more headroom). (2) Limit Metaspace: `-XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m`. (3) Limit thread stack: `-XX:ThreadStackSize=512` (512KB per thread vs 1MB default). (4) Set reserved code cache: `-XX:ReservedCodeCacheSize=128m`. (5) Limit direct memory: `-XX:MaxDirectMemorySize=256m`. Result: more predictable memory usage, won't suddenly hit 1GB. Verify: `docker stats app` shows memory trend before/after tuning.

Follow-up: JVM periodically does full GC (stop-the-world). OOM killer triggers during GC pause. How do you prevent OOM kills caused by GC timing?

Container memory spike to 1.5GB (limit 1GB). OOM killer selects PID 15 and sends SIGKILL. But PID 1 (main app) is a small init process; PID 15 is a child worker. After OOM kill, PID 15 dies, but PID 1 continues. App becomes unstable (missing worker). How do you ensure OOM killer targets the right process?

Linux OOM killer uses an "oom_score" heuristic: selects process consuming most memory (badness score). Score factors: memory percentage (relative to total limit), oom_score_adj (admin tweakable). By default, all processes in cgroup share the same limit and are equally vulnerable. The OOM killer picks the one with highest badness (usually the one consuming most RSS). When PID 15 is selected: kernel kills it with SIGKILL (uncatchable, immediate). PID 1 (init) survives because it's small. Fix: (1) Set oom_score_adj to protect critical processes: `echo -500 > /proc/1/oom_score_adj` (negative = less likely to be killed). `echo 100 > /proc/15/oom_score_adj` (positive = more likely to be killed). (2) Docker flag: `--oom-kill-disable` prevents OOM killer (risky: app freezes or crashes). (3) Better: set reasonable memory limit based on actual usage: `docker stats` shows actual peak, set limit 20% higher to provide buffer. (4) Use `--memory-reservation` (soft limit) to warn before hard limit: `-m 1024m --memory-reservation 768m` → swap to disk at 768MB, hard kill at 1GB. (5) Ensure init process respects resource limits or is just a supervisor (Tini, dumb-init). Verify: `docker run -d myapp && docker inspect myapp | jq '.HostConfig.Memory'` confirms limit. Inside container: `cat /proc/self/oom_score` and `cat /proc//oom_score_adj` to see current settings. Test: stress-ng to trigger OOM, watch which PID is killed.

Follow-up: Your app uses memory pools (pre-allocated). Pool size is static at startup but wrong estimate causes OOM. How do you dynamically adjust pool size at runtime?

Container A: 512MB limit. Container B: 1GB limit. Both on same host. Host has 8GB total RAM. Initially A uses 100MB, B uses 500MB. Later, B allocates 1.5GB (exceeds limit). Does B get OOM-killed, or does it spill to swap? How does memory overcommit affect co-location?

By default, Docker doesn't enable swap per container (memory limit = hard limit with no swap fallback). When B tries to allocate 1.5GB in 1GB container, kernel blocks the allocation (no swap), and if already at limit, OOM killer triggers on processes in B's cgroup. Result: B gets OOM-killed. Host still has 7GB free, but B can't use it (limit enforced by cgroup). If swap is enabled (via `--memory-swap 1.5gb`), B can spill to disk: allocate up to 500MB extra in swap (1.5GB - 1GB limit). But swap is slow (disk I/O). Performance degrades. Memory overcommit: if you run multiple containers with combined limits > physical RAM, you're overcommitting. Example: A (512MB) + B (1GB) + C (2GB) = 3.5GB limit on 8GB host. If all three hit limits simultaneously, total usage = 3.5GB (safe). But if overcommit higher (e.g., 6 containers × 2GB limit = 12GB on 8GB host), when all hit limits, host OOM kills one or more containers based on global priority. Best practice: (1) avoid overcommit; sum of container limits ≤ host physical RAM. (2) use memory reservations (soft limits) to trigger warnings before hard limits. (3) enable swap if available for workloads tolerating latency spikes. Verify: `docker stats` shows memory for each container; `free -h` on host shows total available. Audit: `for c in $(docker ps -q); do docker inspect $c | jq '.HostConfig.Memory'; done | awk '{sum+=$1} END {print sum/1e9 " GB"}'` totals container limits vs host RAM.

Follow-up: Swap is too slow for your app (p99 latency SLA). You disable swap but allow memory to be shared (page sharing between containers). How do you detect and tune for page sharing?

Container `docker run -m 512m node:18 npm start`. Node app stores all user sessions in memory (Redis-less). 100 concurrent users = 500MB memory. OOM killer fires (exit 137). App restarts, sessions lost. For a resilient system, you need to prevent OOM and recover gracefully. Explain a production strategy.

Strategy: (1) Memory monitoring + graceful degradation: add memory telemetry. Inside app: monitor memory usage (e.g., `process.memoryUsage()` in Node), log warnings at 80% threshold. (2) Implement session eviction: LRU cache with maximum size (e.g., 1000 sessions max, evict oldest on addition). Track usage: `if (memUsage > 0.8 * limit) { evict LRU sessions; }`. (3) Offload sessions to persistent storage: use Redis (separate container or cloud service) for session storage. Node app queries Redis on login, not storing in-memory. (4) Container memory policy: set memory limit higher than peak expected usage plus buffer: `docker run -m 1024m` (double the peak). (5) Add memory reservation (soft limit) to warn before hard OOM: `docker run -m 1024m --memory-reservation 900m`, which swaps at 900MB but hard-kills at 1GB. (6) Health check + restart policy: `docker run --health-cmd 'curl -f http://localhost/health || exit 1' --restart unless-stopped`. If memory pressure causes app to become unresponsive, health check fails, container restarts (clean state, sessions lost but app recovers). (7) Graceful shutdown: on SIGTERM (before OOM kill), app flushes sessions to Redis, gracefully closes connections. Verify: set limit to intentionally tight value (e.g., 256m), add memory pressure, watch app behavior (logging, session eviction, restart). Test: `docker run -m 256m --health-cmd 'ps aux' node:18 node app.js` under stress, monitor logs for warnings.

Follow-up: Sessions are sensitive; you don't want them lost even during restarts. Should you use durable session storage, and what's the latency cost?

Kubernetes pod with memory request 512MB, limit 1024MB. Node has 8GB free. Pod's app memory grows to 1GB (at limit). Kubelet monitors via `--reserved-memory` settings. Does the pod get evicted, or does OOM killer fire inside the container?

Kubelet monitors memory; OOM killer fires inside the container. Kubelet doesn't preemptively evict. Scenario: (1) Pod allocates up to 1GB (hit limit). (2) Container's cgroup enforces limit; OOM killer tries reclaim. (3) If app still tries to allocate beyond 1GB, kernel sends SIGKILL to app processes. (4) Container exits with code 137 (OOM killed). Kubelet detects crash, respects restart policy (`restartPolicy: Always` by default), restarts the container (clean state). For node-level pressure: if multiple pods collectively use > 80% of node memory (configurable via `--eviction-hard memory.available<100Mi`), Kubelet evicts pods based on QoS class: (1) Best Effort (no requests/limits): evicted first. (2) Burstable (requests < limits): evicted second. (3) Guaranteed (requests == limits): evicted last. Verify: `kubectl describe node ` shows "Allocatable memory", "Allocated resources (memory)". If pod is Guaranteed (request 512MB, limit 512MB), it's protected until Guaranteed pods exhaust memory. If Burstable (request 512MB, limit 1GB), it can be evicted when node memory pressure hits. Prevent OOM evictions: (1) set pod memory request equal to limit (Guaranteed QoS). (2) set limit based on max expected usage (not unlimited). (3) use memory requests for scheduler bin-packing (prevents over-scheduling). Verify: `kubectl get pod -o json | jq '.spec.containers[].resources'` shows request/limit. Test: create pod with tight memory limit, run stress-ng to trigger OOM, watch logs: `kubectl logs ` shows exit code 137.

Follow-up: Two pods evicted due to node memory pressure. One is a stateless app, one is a cache holding hot data. How do you prioritize which pod should be evicted?

Container memory limit set via cgroups v1. App uses 800MB. `docker stats` reports 800MB. But `cat /proc//status | grep Vm*` shows different numbers (VmRSS, VmSwap, VmPeak). Why the discrepancy? Which tool is most accurate?

Different tools measure differently: (1) `docker stats` reads cgroup memory.stat (total RSS + cache + swap). (2) `/proc//status` shows process-specific VmRSS (resident set size, pages currently in RAM), VmSwap (pages in swap), VmPeak (peak RSS ever). Discrepancy happens because: (1) Cgroup includes all processes in the cgroup (if container has init + app, cgroup sums both). (2) `/proc//status` is single-process only. (3) Page cache: cgroup includes page cache (files read/written to disk), `/proc//status` doesn't clearly separate it. (4) Swap: cgroup tracks swap usage separately in cgroups v1 (memory.memsw), but docker stats aggregates them. Most accurate depends on question: (1) "How much RAM is this container using?" → `docker stats` (cgroup total). (2) "How much RAM is the main app process using?" → `cat /proc//status | grep VmRSS` (single process). (3) "Is the container in swap?" → `cat /sys/fs/cgroup/memory/docker/container-id/memory.memsw.stat | grep swap` (cgroup swap usage). Verify: run container, allocate 100MB in app, check both: `docker stats` shows increase, `VmRSS` shows increase, `memory.stat` shows increase. For debugging OOM: check `cat /sys/fs/cgroup/memory/docker/container-id/memory.stat` for `total_rss`, `cache`, `swap` separately. If `total_rss + cache > limit`, app is over limit; if mostly cache, pages can be reclaimed (not immediately OOM). For memory budgeting: use cgroup total as source of truth (docker stats), plan container limits accordingly.

Follow-up: You see high page cache usage in memory.stat. How do you determine if it's beneficial (file reads accelerated) or wasteful (stale cache)?

Container running a Redis clone in-memory data store. Size: 5GB. `-m 6gb` limit. During a replication snapshot, memory temporarily needs 7GB (5GB data + 2GB snapshot). Container gets OOM-killed mid-snapshot (exit 137). Replicas don't get snapshot, resync fails. How do you safely trigger large memory operations without OOM?

Snapshot needs temporary extra memory (data duplication). Options: (1) Pre-allocate buffer: increase memory limit temporarily before snapshot. In Kubernetes: update Pod spec memory limit upward (rolling update, no downtime if old pod drains gracefully). (2) Use CoW (copy-on-write) snapshot: fork() child process, child writes snapshot (Linux CoW means shared pages, minimal extra memory). Redis `BGSAVE` uses this. For in-memory store: `BGSAVE` triggers fork, child process writes snapshot to disk, parent continues serving. Memory usage: 5GB (parent) + minimal overhead (child pages not yet modified). Result: no OOM. (3) Throttle snapshot writes: write snapshot slowly, don't allocate 2GB at once. Stream to disk incrementally: `buffer_size = 100MB, write in chunks, repeat`. Memory overhead: 100MB (one chunk buffer). (4) Pre-snapshot memory check: before BGSAVE, check if `6GB - current_memory >= snapshot_size`. If not, defer snapshot. (5) Set `--memory-swap 10gb` (swap fallback): child can spill to disk if it exceeds 6GB, trades latency for availability. For production: (1) use CoW-based snapshots (BGSAVE is standard for Redis, RocksDB has background compaction). (2) set memory limit with headroom (e.g., 7GB for 5GB data). (3) configure swap if available. (4) monitor snapshot frequency: avoid too-frequent snapshots competing for memory. Verify: inside container, run `BGSAVE` and watch memory: `docker stats` should show spike but stay below limit (if CoW works). Test: `redis-benchmark -c 100 -n 1000000` to fill memory, then `BGSAVE`, watch stats.

Follow-up: Your container supports BGSAVE but also handles write requests during snapshot. Writes increase memory (new data). Can the snapshot + writes cause OOM during the operation?

Container with `-m 512m` runs multiple workers (processes). One worker leaks memory (grows 10MB/hour). After 2 days, container hits OOM (exit 137). Kubelet restarts. Cycle repeats. How do you identify the leaking process, patch, and prevent regression?

Identify: (1) Inside container, monitor per-process memory: `ps aux | sort -k4 -rn | head -5` shows top memory consumers. If one process RSS grows over time, it's leaking. (2) Use `pmap` to show detailed memory regions: `pmap -x ` lists shared libraries, heap, stack, etc. If heap grows but no new allocations, it's a leak. (3) Tools: for Node.js: `npm install clinic` or `node --inspect`, create heap snapshots over time, compare. For Python: `tracemalloc`, `memory_profiler`. For Go: `pprof`. For generic: `valgrind --leak-check=full`, though slow. Diagnose: run container, capture process memory daily, plot trend. If upward slope, isolate the process. Patch: (1) fix the leak in app code (most important). (2) interim workaround: restart container on schedule (cron job via K8s CronJob) before 2-day OOM deadline. (3) set memory limit higher (temporary mitigation, not fix). Prevent regression: (1) add memory telemetry to CI/CD: run app for N hours, check RSS doesn't grow beyond threshold. (2) Use memory profiling in tests: `pytest --profile memory` or equivalent. (3) Production monitoring: export process memory metrics to Prometheus: `MEMORY_MB = process_rss_bytes / 1e6`, alert if growth rate > threshold (e.g., 10MB/hour). Verify: test container with known leak, run for 24h, confirm leak detection triggers alert. Fix leak, retest, confirm alert clears.

Follow-up: Memory leak is in a third-party library. You can't patch upstream quickly. How do you live-patch or work around the leak?