Docker Interview — Union Filesystems and OverlayFS

A container is writing to a file on what should be a read-only filesystem layer. The write succeeds, but the file doesn't appear in the final image or persist after the container stops. The developer thinks the file was written to disk. Explain what's happening and why writes to read-only layers don't actually modify the layer.

This is copy-on-write (CoW) behavior in overlayfs. Explain: (1) Docker images are composed of read-only layers stacked on top of each other. The container gets a thin read-write layer on top. When a container writes to a file that exists in a lower (read-only) layer, overlayfs intercepts the write. (2) The kernel doesn't modify the lower layer (that's read-only). Instead, overlayfs copies the entire file to the container's writable layer and performs the write there. This is "copy-on-write." (3) The file now exists in two places: the original (unmodified) in the read-only layer below, and the modified copy in the writable layer above. When reading, overlayfs returns the version from the writable layer. (4) If the container stops, the writable layer is deleted (unless you explicitly save it). The modified file is gone. The read-only layers remain unchanged. (5) This is why writes don't appear in the final image: images are built from read-only layers. The container's writable layer is ephemeral and not committed to the image unless you explicitly run docker commit. (6) This is also why deleting a file in a container doesn't reduce image size: deletion creates a "whiteout" in the writable layer (a marker saying "this file is deleted") but doesn't remove it from lower layers. The read-only layer still has the file. It just appears deleted in the container. Example: base image has /var/log/app.log (100MB in read-only layer). Container appends to it. Copy-on-write copies the 100MB file to the writable layer, appends data, and returns the modified version. The original 100MB in the read-only layer is untouched. When container stops, the writable layer (with the modified 100MB) is deleted. The read-only layer (with the original 100MB) remains. Next container run starts fresh with no appended data. This is why you need volumes for persistent container data.

Follow-up: How much overhead does copy-on-write add for large files? If a container modifies a 1GB file, does the entire 1GB get copied?

Your container application writes logs to /var/log/app.log. The file grows over the container's lifetime. When the container stops, the logs are gone. You can't see historical logs. Meanwhile, the overlay2 storage driver is accumulating writable layer data. How do you persist logs without using volumes?

Logs in the container's writable layer are ephemeral. Solutions: (1) Use volumes: docker run -v /var/log/app /app. The /var/log directory is a mount point to host storage, bypassing the overlay2 writable layer. Logs persist on the host after container stops. (2) Use logging drivers: instead of writing to files in the container, configure Docker's logging driver (json-file, syslog, awslogs) to capture stdout/stderr. The container's logs are stored outside the overlay2 layer. (3) Emit logs to stdout/stderr: modify the app to log to stdout instead of files. Docker captures this automatically. docker logs container-name retrieves them. (4) Use a sidecar logging container: run a separate logging container that tails logs from the app container and writes them to external storage or a logging service. (5) Use a persistent log directory via volumes: mount a host directory at /var/log: docker run -v /host/logs:/var/log/app. Logs are written to /host/logs, persisting after container stops. (6) Use named volumes: docker volume create app-logs, then docker run -v app-logs:/var/log/app. Docker manages the storage location. (7) For large logs, implement log rotation inside the container: even with overlay2, you can manage log size by rotating logs periodically. But logs are still lost when container stops unless exported. Best practice: use logging drivers (especially for production). This centralizes logs, prevents overlay2 bloat, and ensures logs persist. For local development, mount a host directory as a volume. Example: docker run -v $(pwd)/logs:/var/log/app -d app. This captures logs to host's ./logs directory.

Follow-up: What's the performance difference between writing to overlay2 layers vs. volumes? Which is faster?

You have a production system running 100 containers with overlay2 storage driver. Each container generates a writable layer (delta) of ~500MB. The host's /var/lib/docker/overlay2 directory consumes 50GB. You're running low on disk space. The layers are filled with temporary files and logs that could be cleaned up. How do you reclaim disk space from overlay2?

Overlay2 can consume significant disk space if writable layers grow large. Reclaim space: (1) Identify large layers: du -sh /var/lib/docker/overlay2/*/diff. This shows the size of each layer's writable delta. (2) Stop and remove unused containers: docker container prune. This deletes stopped containers and their writable layers, freeing space immediately. (3) Clean up dangling images: docker image prune -a. Remove unused images, freeing their layers. (4) Use docker system prune: this is aggressive—removes stopped containers, unused images, unused volumes, and dangling layers. docker system prune -a. (5) Reduce writable layer growth: redirect logs and temporary files to volumes or logging drivers (as described in previous question). This prevents them from accumulating in overlay2. (6) Set container size limits: docker run --storage-opt size=2g app. This limits the writable layer to 2GB. Once full, writes fail. This forces the app to use external storage (volumes) instead. (7) Use devicemapper or snapshots driver: these drivers have better space efficiency than overlay2 for certain workloads. But overlay2 is usually preferred. (8) Monitor overlay2 usage: set up alerts if /var/lib/docker/overlay2 exceeds a threshold. This lets you take action before running out of space. (9) For production, use a persistent storage solution (volumes, network storage) for data-bearing containers. Keep overlay2 for stateless containers whose writable layers should be small. Example: production app with EBS-backed volume for /data. App writes temporary data to /tmp (overlay2), but persistent data goes to /data (volume). When container stops, /tmp is deleted, but /data persists. This keeps overlay2 consumption low.

Follow-up: What's the difference between overlay2 and devicemapper? When should you use each?

You're debugging container I/O performance. A container's writes to /data are slow. You've verified the host has plenty of disk I/O throughput. The issue is overlay2 copy-on-write overhead. When the container writes to a file in a lower layer, the entire file is copied to the writable layer, which is slow for large files. How do you optimize I/O for write-heavy containers?

Copy-on-write can cause I/O bottlenecks for write-heavy workloads. Optimize: (1) Use volumes instead of overlay2 for write-heavy data: docker run -v /data app. The /data directory bypasses overlay2 and writes directly to host storage (or network storage), avoiding CoW overhead. (2) Use bind mounts for high-throughput I/O: docker run -v /host/data:/container/data app. Bind mounts are faster than copy-on-write. (3) Use tmpfs for ephemeral data: docker run --tmpfs /tmp app. Temporary files are stored in memory, avoiding disk I/O entirely. (4) Reduce file sizes in lower layers: if the image has large files that the container will modify, split them. Use smaller files or compress them. This reduces CoW overhead. (5) Use a faster storage backend for overlay2: SSD instead of HDD. CoW is still used, but at higher throughput. (6) Pre-allocate space for writable layers: use docker run --storage-opt size=10g. This allocates 10GB for the writable layer upfront, avoiding allocation overhead during writes. (7) Use devicemapper thin provisioning: configure docker daemon to use devicemapper driver with thin provisioning. This is more efficient for write-heavy workloads. (8) Implement write-ahead logging in your app: instead of modifying large files, append writes to a log. This avoids large CoW operations. (9) Profile I/O: use iostat and docker stats to identify where I/O is slow. If it's consistently slow during writes, use volumes. If it's fast, overlay2 is fine. Example: for database container, use volume: docker run -v postgres-data:/var/lib/postgresql app. Database writes go directly to storage, not through overlay2. For ephemeral containers (web servers), use overlay2—writable layer is small and ephemeral. This ensures write-heavy containers use fast I/O paths.

Follow-up: How does a union filesystem differ from traditional filesystem copy-on-write? Are they the same mechanism?

You're building a Docker image with many layers. Each layer adds files, some of which become obsolete in later layers. For example, layer 1 adds a 100MB build tool, layer 2 deletes it. The final image is 100MB larger than it should be because the deleted file is still in layer 1, just hidden by a whiteout. How do you minimize image size when layers contain deletions?

Overlay2 stores deletions as whiteouts, not removals. Deletions don't reduce layer size. Minimize image size: (1) Reorganize Dockerfile to avoid unnecessary files: instead of adding a file then deleting it, don't add it in the first place. (2) Combine RUN commands: docker processes each RUN instruction as a separate layer. If you run multiple operations, combine them: RUN apt-get install && apt-get clean && rm -rf /var/cache. This ensures cleanup happens in the same layer as installation, not a separate layer. (3) Use multi-stage builds: move build artifacts to a separate stage; the final stage only includes what's needed. (4) Minimize layer count: fewer layers mean fewer potential deletions. Combine steps. (5) Clean up as part of the same RUN: RUN apt-get install -y gcc && apt-get clean && rm -rf /var/lib/apt/lists/*. This removes package manager cache in the same layer. If you do cleanup in a separate RUN, it creates a whiteout in the new layer, but the original layer still contains the 100MB. (6) Use .dockerignore to exclude unnecessary files from COPY: don't include build artifacts, test files, or dependencies that you'll delete later. (7) For unavoidable deletions (e.g., from a base image), accept them. The image is larger than optimal, but it's a one-time cost. (8) Squash layers to remove whiteouts: docker run --squash (experimental feature) merges all layers into one, removing whiteouts. However, squashing prevents layer reuse and caching. Example: BAD: RUN apt-get install gcc && RUN rm -rf /var/cache. This leaves 100MB from gcc in layer 1, hidden by a whiteout in layer 2. GOOD: RUN apt-get install gcc && rm -rf /var/cache. This cleans up in the same layer. Image is much smaller. For images with many layers and deletions, consider squashing, but be aware it breaks caching and layer reuse.

Follow-up: What exactly is a whiteout in overlayfs? How does Docker implement file deletion in union filesystems?

You're running containers with overlay2 storage driver. Over time, the host's inode count is increasing even though you're removing containers. The inode count never decreases. Eventually, you hit the inode limit and can't create new files, even though disk space is available. What's happening and how do you recover?

Inode exhaustion is a separate issue from disk space exhaustion. Explain: (1) Inodes are filesystem metadata structures; each file/directory uses one. Even empty files use inodes. (2) Overlay2 creates inodes for each file in each layer. When a container writes a file, overlay2 creates a copy in the writable layer (new inode). (3) When you delete a container, the writable layer is removed, freeing inodes. But if you have many dangling layers (from failed builds, intermediate images), they accumulate inodes. (4) Check inode usage: df -i shows inode counts. If used is close to total, you're running out. (5) Identify inode hogs: find /var/lib/docker/overlay2 -type f | wc -l shows total file count. If this is huge (millions), you have a problem. (6) Clean up: docker system prune -a removes unused layers. docker container prune removes stopped containers. (7) Remove specific dangling resources: docker image prune (unused images), docker volume prune (unused volumes). (8) Force cleanup: docker builder prune (remove build cache). (9) For recovery: if you're at inode limit, you may need to remount the filesystem or expand. Check filesystem capabilities: tune2fs -l /dev/sdXY | grep -i inode. For ext4, inode count is set at format time and usually large enough. (10) Prevention: monitor inode usage and alert when usage exceeds 70-80%. Implement regular cleanup (prune commands in cron). (11) Use different storage drivers if applicable: some drivers (like devicemapper) handle inode allocation differently. Example: run docker system prune -a to reclaim inodes from unused images and containers. Monitor inode usage with df -i. Alert if usage exceeds 80%. This prevents hitting inode limits unexpectedly.

Follow-up: What's the difference between inode exhaustion and disk space exhaustion? Why does one occur without the other?

You're implementing a container that modifies a configuration file from the base image. The app reads a default config from the image, then overwrites it based on environment variables. The writable layer creates a copy of the entire config file (copy-on-write). For large config files (500MB), this overhead is significant. Design an approach that avoids unnecessary copying.

CoW can be expensive for large files. Optimize: (1) Don't modify read-only files: instead of modifying a config file from the base image, create a new config file in the writable layer. Keep the original untouched. Example: base image has /etc/config/default.conf (read-only). Container creates /etc/config/app.conf (writable, created fresh). App reads both: first the defaults, then app-specific overrides. (2) Use volumes for mutable config: mount config as a volume instead of embedding it in the image. docker run -v /host/config.conf:/etc/config/app.conf app. The volume is separate from overlay2, so no CoW. (3) Use environment variable substitution: instead of modifying files, pass config via environment variables. The app reads env vars instead of files. (4) Use configuration management tools: use templating (Jinja2, Handlebars) to generate config files at startup. The generated files are created in the writable layer fresh, without CoW overhead. (5) Split large files: if config is 500MB (unusual but possible), break it into chunks. Only modify the chunks you need. (6) Use memory-mapped files: if the app can mmap a file and modify it in-place, overlayfs is more efficient. But this is rare and requires app-level support. (7) Pre-allocate writable layer: use docker run --storage-opt size=1g to allocate space for the writable layer upfront. CoW still happens, but allocation is pre-done. (8) Measure CoW overhead: profile with docker stats and iostat during startup. If CoW overhead is < 1% of total I/O, it's not worth optimizing. Example: base image has /data/config.json (100MB, read-only). Instead of modifying it (costly CoW), app creates /data/config.override.json (writable, generated fresh in <1MB). App merges both at startup. This avoids CoW overhead while keeping config flexible.

Follow-up: How can you measure copy-on-write overhead empirically? What tools and metrics should you use?

A container is run with overlay2 storage. It creates a deep directory structure (e.g., /a/b/c/d/e/f/g/h/file.txt) and modifies the file. Overlayfs intercepts the write and copies the entire file. But what about the directory structure itself? Do directories get copied on every write? Explain how overlayfs handles directory metadata.

Overlayfs handles directories differently than regular files. Understand: (1) When a file is modified, only that file is copied (copy-on-write). The directory structure above it is not copied. (2) However, when a directory is accessed (to traverse to a file), overlayfs checks all layers. For deep directory structures, this traversal can have overhead. (3) If a directory itself is modified (e.g., a new file is created in /a/b/c), overlayfs creates a whiteout entry in the writable layer. The directory structure below is not copied. (4) Directory metadata (permissions, ownership) is inherited from the lower layer. Changes to directory metadata don't trigger CoW of the directory itself—only a metadata update in the writable layer. (5) Accessing files in deep structures: overlayfs must traverse all layers to find the file. This is efficient (just inode lookups), but for extremely deep structures (100+ levels), there can be noticeable overhead. (6) To optimize: keep directory structures shallow. Limit nesting depth to < 20 levels. This reduces traversal overhead. (7) For read-heavy deep structures: the overhead is negligible (inode lookups are fast). For write-heavy structures: the overhead is from copying files, not traversing directories. (8) Overlayfs caches directory lookups, so repeated accesses are fast. The first access to a deep path does traversal; subsequent accesses use cache. Implementation: overlayfs is efficient for most practical directory structures. Deep nesting (1000+ levels) may show measurable overhead, but typical applications rarely go deeper than 20 levels. If you notice performance issues with specific directory access patterns, profile with strace -e open,openat, which shows directory traversal calls. In practice, directory structure depth is rarely a bottleneck compared to file I/O overhead from CoW.

Follow-up: How does overlayfs cache directory metadata? Is it invalidated when lower layers change?