Java app in container: `-m 1024m` (1GB limit), JVM arg `-Xmx512m` (512MB heap). App crashes with OOM kill (exit code 137). You inspect logs: heap usage peaked at 400MB, well below the 512MB limit. But total container memory hit 1GB and triggered OOM killer. What's consuming the extra 600MB?
JVM heap (`-Xmx`) is only one component of JVM memory. Other consumers: (1) Off-heap memory: direct ByteBuffers, NIO buffers, memory-mapped files (untracked by heap GC). (2) Code cache: compiled JIT methods. (3) Metaspace: class metadata. (4) Thread stacks: each thread ~1MB default. (5) Kernel memory: page cache for files read by app. With `-Xmx512m`, heap is 512MB. Metaspace default is unlimited or 75% of remaining memory. With 50 threads: 50 × 1MB = 50MB. Direct buffers (e.g., Netty, Kafka): can be 100-200MB. Result: 512MB (heap) + 50MB (threads) + 150MB (Metaspace) + 200MB (direct buffers) + kernel cache = ~912MB to 1GB+. When app allocates more direct buffers or spawns more threads, total exceeds 1GB → OOM killer. Diagnose: (1) Inside container before crash: `jcmd
Follow-up: JVM periodically does full GC (stop-the-world). OOM killer triggers during GC pause. How do you prevent OOM kills caused by GC timing?
Container memory spike to 1.5GB (limit 1GB). OOM killer selects PID 15 and sends SIGKILL. But PID 1 (main app) is a small init process; PID 15 is a child worker. After OOM kill, PID 15 dies, but PID 1 continues. App becomes unstable (missing worker). How do you ensure OOM killer targets the right process?
Linux OOM killer uses an "oom_score" heuristic: selects process consuming most memory (badness score). Score factors: memory percentage (relative to total limit), oom_score_adj (admin tweakable). By default, all processes in cgroup share the same limit and are equally vulnerable. The OOM killer picks the one with highest badness (usually the one consuming most RSS). When PID 15 is selected: kernel kills it with SIGKILL (uncatchable, immediate). PID 1 (init) survives because it's small. Fix: (1) Set oom_score_adj to protect critical processes: `echo -500 > /proc/1/oom_score_adj` (negative = less likely to be killed). `echo 100 > /proc/15/oom_score_adj` (positive = more likely to be killed). (2) Docker flag: `--oom-kill-disable` prevents OOM killer (risky: app freezes or crashes). (3) Better: set reasonable memory limit based on actual usage: `docker stats` shows actual peak, set limit 20% higher to provide buffer. (4) Use `--memory-reservation` (soft limit) to warn before hard limit: `-m 1024m --memory-reservation 768m` → swap to disk at 768MB, hard kill at 1GB. (5) Ensure init process respects resource limits or is just a supervisor (Tini, dumb-init). Verify: `docker run -d myapp && docker inspect myapp | jq '.HostConfig.Memory'` confirms limit. Inside container: `cat /proc/self/oom_score` and `cat /proc/
Follow-up: Your app uses memory pools (pre-allocated). Pool size is static at startup but wrong estimate causes OOM. How do you dynamically adjust pool size at runtime?
Container A: 512MB limit. Container B: 1GB limit. Both on same host. Host has 8GB total RAM. Initially A uses 100MB, B uses 500MB. Later, B allocates 1.5GB (exceeds limit). Does B get OOM-killed, or does it spill to swap? How does memory overcommit affect co-location?
By default, Docker doesn't enable swap per container (memory limit = hard limit with no swap fallback). When B tries to allocate 1.5GB in 1GB container, kernel blocks the allocation (no swap), and if already at limit, OOM killer triggers on processes in B's cgroup. Result: B gets OOM-killed. Host still has 7GB free, but B can't use it (limit enforced by cgroup). If swap is enabled (via `--memory-swap 1.5gb`), B can spill to disk: allocate up to 500MB extra in swap (1.5GB - 1GB limit). But swap is slow (disk I/O). Performance degrades. Memory overcommit: if you run multiple containers with combined limits > physical RAM, you're overcommitting. Example: A (512MB) + B (1GB) + C (2GB) = 3.5GB limit on 8GB host. If all three hit limits simultaneously, total usage = 3.5GB (safe). But if overcommit higher (e.g., 6 containers × 2GB limit = 12GB on 8GB host), when all hit limits, host OOM kills one or more containers based on global priority. Best practice: (1) avoid overcommit; sum of container limits ≤ host physical RAM. (2) use memory reservations (soft limits) to trigger warnings before hard limits. (3) enable swap if available for workloads tolerating latency spikes. Verify: `docker stats` shows memory for each container; `free -h` on host shows total available. Audit: `for c in $(docker ps -q); do docker inspect $c | jq '.HostConfig.Memory'; done | awk '{sum+=$1} END {print sum/1e9 " GB"}'` totals container limits vs host RAM.
Follow-up: Swap is too slow for your app (p99 latency SLA). You disable swap but allow memory to be shared (page sharing between containers). How do you detect and tune for page sharing?
Container `docker run -m 512m node:18 npm start`. Node app stores all user sessions in memory (Redis-less). 100 concurrent users = 500MB memory. OOM killer fires (exit 137). App restarts, sessions lost. For a resilient system, you need to prevent OOM and recover gracefully. Explain a production strategy.
Strategy: (1) Memory monitoring + graceful degradation: add memory telemetry. Inside app: monitor memory usage (e.g., `process.memoryUsage()` in Node), log warnings at 80% threshold. (2) Implement session eviction: LRU cache with maximum size (e.g., 1000 sessions max, evict oldest on addition). Track usage: `if (memUsage > 0.8 * limit) { evict LRU sessions; }`. (3) Offload sessions to persistent storage: use Redis (separate container or cloud service) for session storage. Node app queries Redis on login, not storing in-memory. (4) Container memory policy: set memory limit higher than peak expected usage plus buffer: `docker run -m 1024m` (double the peak). (5) Add memory reservation (soft limit) to warn before hard OOM: `docker run -m 1024m --memory-reservation 900m`, which swaps at 900MB but hard-kills at 1GB. (6) Health check + restart policy: `docker run --health-cmd 'curl -f http://localhost/health || exit 1' --restart unless-stopped`. If memory pressure causes app to become unresponsive, health check fails, container restarts (clean state, sessions lost but app recovers). (7) Graceful shutdown: on SIGTERM (before OOM kill), app flushes sessions to Redis, gracefully closes connections. Verify: set limit to intentionally tight value (e.g., 256m), add memory pressure, watch app behavior (logging, session eviction, restart). Test: `docker run -m 256m --health-cmd 'ps aux' node:18 node app.js` under stress, monitor logs for warnings.
Follow-up: Sessions are sensitive; you don't want them lost even during restarts. Should you use durable session storage, and what's the latency cost?
Kubernetes pod with memory request 512MB, limit 1024MB. Node has 8GB free. Pod's app memory grows to 1GB (at limit). Kubelet monitors via `--reserved-memory` settings. Does the pod get evicted, or does OOM killer fire inside the container?
Kubelet monitors memory; OOM killer fires inside the container. Kubelet doesn't preemptively evict. Scenario: (1) Pod allocates up to 1GB (hit limit). (2) Container's cgroup enforces limit; OOM killer tries reclaim. (3) If app still tries to allocate beyond 1GB, kernel sends SIGKILL to app processes. (4) Container exits with code 137 (OOM killed). Kubelet detects crash, respects restart policy (`restartPolicy: Always` by default), restarts the container (clean state). For node-level pressure: if multiple pods collectively use > 80% of node memory (configurable via `--eviction-hard memory.available<100Mi`), Kubelet evicts pods based on QoS class: (1) Best Effort (no requests/limits): evicted first. (2) Burstable (requests < limits): evicted second. (3) Guaranteed (requests == limits): evicted last. Verify: `kubectl describe node
Follow-up: Two pods evicted due to node memory pressure. One is a stateless app, one is a cache holding hot data. How do you prioritize which pod should be evicted?
Container memory limit set via cgroups v1. App uses 800MB. `docker stats` reports 800MB. But `cat /proc/
Different tools measure differently: (1) `docker stats` reads cgroup memory.stat (total RSS + cache + swap). (2) `/proc/
Follow-up: You see high page cache usage in memory.stat. How do you determine if it's beneficial (file reads accelerated) or wasteful (stale cache)?
Container running a Redis clone in-memory data store. Size: 5GB. `-m 6gb` limit. During a replication snapshot, memory temporarily needs 7GB (5GB data + 2GB snapshot). Container gets OOM-killed mid-snapshot (exit 137). Replicas don't get snapshot, resync fails. How do you safely trigger large memory operations without OOM?
Snapshot needs temporary extra memory (data duplication). Options: (1) Pre-allocate buffer: increase memory limit temporarily before snapshot. In Kubernetes: update Pod spec memory limit upward (rolling update, no downtime if old pod drains gracefully). (2) Use CoW (copy-on-write) snapshot: fork() child process, child writes snapshot (Linux CoW means shared pages, minimal extra memory). Redis `BGSAVE` uses this. For in-memory store: `BGSAVE` triggers fork, child process writes snapshot to disk, parent continues serving. Memory usage: 5GB (parent) + minimal overhead (child pages not yet modified). Result: no OOM. (3) Throttle snapshot writes: write snapshot slowly, don't allocate 2GB at once. Stream to disk incrementally: `buffer_size = 100MB, write in chunks, repeat`. Memory overhead: 100MB (one chunk buffer). (4) Pre-snapshot memory check: before BGSAVE, check if `6GB - current_memory >= snapshot_size`. If not, defer snapshot. (5) Set `--memory-swap 10gb` (swap fallback): child can spill to disk if it exceeds 6GB, trades latency for availability. For production: (1) use CoW-based snapshots (BGSAVE is standard for Redis, RocksDB has background compaction). (2) set memory limit with headroom (e.g., 7GB for 5GB data). (3) configure swap if available. (4) monitor snapshot frequency: avoid too-frequent snapshots competing for memory. Verify: inside container, run `BGSAVE` and watch memory: `docker stats` should show spike but stay below limit (if CoW works). Test: `redis-benchmark -c 100 -n 1000000` to fill memory, then `BGSAVE`, watch stats.
Follow-up: Your container supports BGSAVE but also handles write requests during snapshot. Writes increase memory (new data). Can the snapshot + writes cause OOM during the operation?
Container with `-m 512m` runs multiple workers (processes). One worker leaks memory (grows 10MB/hour). After 2 days, container hits OOM (exit 137). Kubelet restarts. Cycle repeats. How do you identify the leaking process, patch, and prevent regression?
Identify: (1) Inside container, monitor per-process memory: `ps aux | sort -k4 -rn | head -5` shows top memory consumers. If one process RSS grows over time, it's leaking. (2) Use `pmap` to show detailed memory regions: `pmap -x
Follow-up: Memory leak is in a third-party library. You can't patch upstream quickly. How do you live-patch or work around the leak?