Your Docker hosts are filling up with logs. Containers generate 50GB/day, consuming all available disk space. Older containers' logs are deleted automatically by Docker, but you lose debugging information. You need to centralize logs to prevent disk overflow and maintain historical logs for compliance. Design a logging strategy for a 100-container production environment.
Default Docker logging (json-file driver) writes to host disk, causing capacity issues. Solution: (1) Use a centralized logging driver instead of local json-file. Configure docker daemon (/etc/docker/daemon.json): "log-driver": "splunk", "log-opts": {"splunk-token": "...", "splunk-url": "https://splunk:8088"}. This sends logs directly to Splunk, not to disk. (2) Set log rotation to prevent unbounded growth: "log-opts": {"max-size": "10m", "max-file": "3"} for local json-file driver. This keeps only 3 files of 10MB each, totaling 30MB per container. (3) Use awslogs driver for AWS environments: "log-driver": "awslogs", "log-opts": {"awslogs-group": "/ecs/myapp", "awslogs-region": "us-east-1"}. Logs stream to CloudWatch. (4) For on-premises, use syslog or journald: "log-driver": "journald". Logs go to the host's systemd journal. (5) Add log filtering to reduce volume: "log-opts": {"tag": "{{.ImageName}}|{{.ID|.FullID}}", "labels": "app=myapp"}. Only log relevant containers and tag them appropriately. (6) Configure log-level in your application to reduce verbosity: DEBUG logs in dev, INFO or ERROR in production. In docker-compose: logging: driver: awslogs, options: awslogs-group: /logs/app, awslogs-region: us-east-1. This offloads logs from host disk to a centralized system, prevents disk exhaustion, and maintains historical logs for compliance.
Follow-up: What's the performance impact of different logging drivers? Does sending logs to a remote system affect container performance?
You're using the json-file logging driver. A container generates 10,000 lines/second during peak traffic. The json-file driver can't keep up with the write rate. Container I/O blocks waiting for log writes to complete, slowing down the app significantly. How do you prevent logging from degrading performance?
High-volume logging with synchronous I/O causes performance degradation. Solutions: (1) Use the journald or syslog driver, which is asynchronous and buffers logs. (2) Reduce logging verbosity: emit only ERROR and WARN logs to stderr, not DEBUG and INFO. Use log-level configuration in your app. (3) Use partial log output mode: docker run --log-driver json-file --log-opt mode=non-blocking. This makes logging non-blocking—if the logger can't keep up, logs are dropped (instead of blocking the app). (4) Configure log buffer size: --log-opt max-buffer-size=10m allows the logger to buffer up to 10MB before dropping old logs. (5) Offload logging to a dedicated service or sidecar: instead of writing logs synchronously, send them asynchronously to a logging sidecar that batches writes. Example in your app: spawn a thread that reads from a log queue and writes to stdout in batches. (6) Use structured logging with efficient serialization: instead of text logs, use JSON or protobuf. This reduces parsing overhead downstream. Configuration: docker run --log-driver json-file --log-opt mode=non-blocking --log-opt max-buffer-size=10m app. This decouples application performance from logging I/O, ensuring logging doesn't block the main thread.
Follow-up: What happens when logs are dropped due to buffer overflow? How do you know if you're losing logs?
You're using Splunk for centralized logging. Your app logs contain sensitive data (API keys, passwords, PII). The logs are forwarded to Splunk unencrypted and stored without redaction. You need to encrypt logs in transit and scrub sensitive fields before they're indexed. Design a secure logging pipeline.
Sensitive data in logs is a security risk. Implement: (1) Encrypt logs in transit to Splunk using TLS/HTTPS: docker logging driver with splunk-url: "https://splunk:8088" and enable certificate verification. (2) Use the splunk token to authenticate: --log-opt splunk-token=abc123 (this is already the case, but ensure it's stored in secrets). (3) Redact sensitive fields at the source: in your app, before logging, strip API keys, passwords, and PII. Use regex to redact: log_message = re.sub(r'api_key=\S+', 'api_key=***', log_message). (4) Implement log filtering on the logging driver level: use --log-opt labels-regex="app=.*" to exclude certain labels from logging. (5) Use Splunk's redaction rules to scrub sensitive patterns: define regex rules in Splunk configuration to mask API keys, SSNs, etc. (6) Encrypt logs at rest in Splunk: enable Splunk encryption at rest in configuration. (7) Use structured logging with explicit fields: instead of logging raw strings, use key-value pairs. Then selectively exclude sensitive keys from logging. Example: logger.info({user_id: 123, api_key: 'REDACTED', status: 'ok'}). This ensures sensitive data is scrubbed before it reaches the logging system and is encrypted in transit and at rest.
Follow-up: How do you balance security (redacting everything) with debuggability (needing to see enough detail to diagnose issues)?
You're using syslog driver, but logs from multiple containers with the same name get mixed together. The syslog entries don't include container ID or hostname, making it impossible to correlate logs to specific containers. Logs are impossible to parse. How do you tag and structure logs for correlation?
Unstructured logs without container context are useless at scale. Implement structured logging with rich metadata: (1) Use the tag option in logging driver to include container metadata: --log-opt tag="{{.ImageName}}|{{.ID|.FullID}}|{{.Name}}". This adds image name, container ID, and name to every log line. (2) Use labels to add custom metadata: docker run --label app=myapp --label version=1.0 app. Then configure logging to include labels: --log-opt labels=app,version. (3) Emit structured logs from your app: instead of plain text, use JSON format with fields: timestamp, service_name, container_id, user_id, request_id, message. Example: {"timestamp": "2026-04-07T10:00:00Z", "service": "api", "container_id": "abc123", "request_id": "xyz789", "message": "request processed"}. (4) Use correlation IDs: when a request enters the system, assign a UUID and propagate it through all services. Log this ID in every related log entry. (5) Configure syslog with RFC5424 format to include structured data: logger -t myapp -p user.info --rfc5424 'fields:app=myapp,container=abc123 message'. (6) At query time, use log aggregation tool (ELK, Splunk, Datadog) that parses structured logs and allows filtering by container, service, request_id. This ensures every log entry is richly tagged and easily correlated to a specific container and request.
Follow-up: How do you implement distributed tracing with correlation IDs across microservices? What's the standard format?
You're running a service on AWS with CloudWatch Logs driver. During a traffic spike, 50 containers are writing logs simultaneously. CloudWatch ingestion throughput is limited to 1MB/second. Log writes get throttled, and older logs are dropped. Your debug logs disappear exactly when you need them most. How do you handle high-volume logging with CloudWatch?
CloudWatch has throughput limits and ingestion is rate-limited. Solutions: (1) Reduce log verbosity: only log ERROR and WARN in production, not DEBUG and INFO. In your app, use conditional logging based on environment: if ENV == 'production': LOG_LEVEL = ERROR else: LOG_LEVEL = DEBUG. (2) Batch log writes: instead of logging on every request, aggregate logs in memory and flush periodically (e.g., every 100 logs or 5 seconds). (3) Use FireLens with Datadog or Splunk for better throughput: FireLens is an ECS log router that can buffer and batch logs before sending to CloudWatch or other backends. In docker-compose: logging: driver: awsfirelens, options: name: datadog. (4) Use sampling: log only a percentage of requests. Log 100% of ERROR, 10% of INFO, 1% of DEBUG. This reduces volume while maintaining visibility of issues. (5) Use a sidecar aggregator: run a separate logging container (Fluentd, Vector) that batches logs from all app containers and sends them in batches to CloudWatch. (6) Switch to a higher-volume logging backend: Datadog, Splunk, ELK can handle higher throughput than CloudWatch. Example: aws:firelens.log-driver: splunk with authentication. For production, use sampling (log 10% of traffic) + higher LOG_LEVEL + FireLens with Datadog for better throughput. This prevents log throttling while maintaining cost and performance.
Follow-up: How do you implement intelligent sampling? What about sampling based on error rate or latency?
You're using the json-file logging driver with log rotation (max-size: 10m, max-file: 3). A container's logs are rotated, but the rotated log files are never compressed, wasting disk space. You have 100 containers, each with 30MB of uncompressed rotated logs. How do you reduce disk usage from rotated logs?
Docker's json-file driver doesn't compress rotated logs, causing disk bloat. Solutions: (1) Enable log compression: use logrotate on the host to compress Docker logs after rotation. Configure /etc/logrotate.d/docker: /var/lib/docker/containers/*/json.log { daily, rotate 7, compress, delaycompress, missingok, notifempty }. This compresses rotated logs with gzip, reducing size to 5-10% of original. (2) Use a different driver that supports compression: journald compresses logs automatically. (3) Use external logging: send logs to CloudWatch or Splunk where compression is handled server-side. You don't store uncompressed logs on the host. (4) Use a log aggregator sidecar: run Fluentd or Logstash that reads logs, compresses them, and sends to storage. (5) Set aggressive rotation limits: max-size: 5m, max-file: 2 (instead of 10m and 3). Keep only 10MB total on disk per container instead of 30MB. (6) Clean up old rotated logs: write a cron job that deletes docker logs older than 30 days: find /var/lib/docker/containers -name json.log.* -mtime +30 -delete. For production: use journald (built-in compression) or external logging (Splunk). For local dev: use logrotate with compression. This reduces disk usage from 30MB to 3-5MB per container while maintaining log history for debugging.
Follow-up: What's the trade-off between log compression and query performance? Do compressed logs cause latency when searched?
You're debugging a production issue and need to trace a specific request through multiple containers. The app's logs include the request ID, but the Docker logging driver doesn't preserve it in the structured format. You're searching through raw json-file logs and filtering manually, which is time-consuming. How do you set up structured logging for multi-container debugging?
Debugging multi-container issues requires request tracing through logs. Implement: (1) Use structured JSON logging from your app: emit logs as JSON with all relevant fields including request_id, user_id, service_name, timestamp. Example in Node.js: console.log(JSON.stringify({timestamp: new Date(), request_id: req.id, service: 'api', message: 'request received'})); (2) Configure Docker logging to parse and index structured logs: use journald or awslogs with structured data support. Or use a log aggregator that parses JSON. (3) Implement request correlation: when a request enters the system (e.g., at the load balancer or API gateway), assign a UUID. Pass this ID to all downstream services via headers (X-Request-ID or similar). Each service logs this ID. (4) Use a log aggregation platform (ELK, Datadog, Splunk) that can parse JSON logs and create indexes on request_id. Then query all logs for a specific request_id across all containers. (5) Use distributed tracing (Jaeger, DataDog APM): instead of just logs, trace request flow through all services with timing information. (6) In your app, emit logs with trace context: when you log, include trace_id, span_id, and parent_span_id (OpenTelemetry format). Log aggregators can then reconstruct the request flow. Example query in Datadog: logs | filter request_id:"xyz789" | stats count() by service_name. This allows you to search for a single request across all containers and trace its path, making debugging production issues fast and reliable.
Follow-up: What's the overhead of emitting structured JSON logs on every request? How does it affect performance?
You're running containers on hosts with limited disk space (e.g., edge devices, IoT). You can't store logs locally or send them to a remote system (bandwidth is limited). You need some form of logging for debugging without consuming disk or bandwidth. What's the trade-off you need to make?
On resource-constrained systems, you must choose between visibility and resources. Implement a tiered approach: (1) Use in-memory circular buffer logging: your app keeps the last N log lines in memory (e.g., 100 lines for a 10MB buffer). When the buffer fills, old logs are overwritten. On error, dump the buffer to stderr for collection. (2) Use sampling and aggregation: emit only critical errors and warnings; sample INFO logs (1 in 100); never emit DEBUG logs. This reduces volume by 99%. (3) Use metrics instead of logs: instead of logging "request_latency_ms: 150", emit a metric (counter, histogram) that's much more compact. Metrics are designed for high-volume data. (4) Use a lightweight logger: replace heavy logging libraries with minimal ones (e.g., syslog instead of ELK). (5) Use remote logging via batching: batch logs in memory and send them to a remote system in single bulk operations. This reduces bandwidth and I/O. (6) Emit logs only on errors and summaries: normally, app is silent. On error, emit full context. Periodically, emit summary metrics. Example: a web server emits nothing during normal operation, but if 1000 requests fail, it emits error summary. (7) Use container logs only for startup/shutdown; use metrics for runtime observability. For edge devices: enable in-memory circular buffer (trade visibility for resources) + error sampling + metrics. This keeps disk and bandwidth usage minimal while maintaining observability for critical issues.
Follow-up: How do you balance observability (knowing what's happening) with resource constraints on edge devices?