Kubernetes Interview — Init Containers, Sidecars, and Pod Lifecycle

Your deployment uses an Envoy sidecar proxy (for service mesh). The main app container starts and immediately tries to connect to an external database. But the Envoy sidecar takes 2-3 seconds to start. Your app fails to connect before Envoy is ready and crashes with CrashLoopBackOff. How do you fix this without changing the application code?

Pod startup sequence: Kubernetes runs init containers first (sequentially), then main containers (in parallel). The issue: main app container starts immediately (doesn't wait for sidecars). App tries to use the sidecar proxy before it's ready, fails.

Solution 1 - Init container to configure network: Use an init container to set up iptables rules that intercept outbound traffic. Even if Envoy sidecar starts after app, traffic is already redirected to Envoy. Init container: initContainers: [{name: "proxy-init", image: "istio/proxyv2", securityContext: {privileged: true}, command: ["sh", "-c", "iptables -t nat -A OUTPUT -p tcp ! -d 127.0.0.1 -j REDIRECT --to-port 15001"]}]. This redirects all TCP traffic (except localhost) to port 15001 (Envoy). Now when app connects, traffic goes to Envoy queue even if Envoy process hasn't started.

Solution 2 - Startup probe on sidecar: Add a startup probe to the main app that waits for the sidecar to be ready: startupProbe: {exec: {command: ["sh", "-c", "curl -s http://localhost:15000/stats > /dev/null"]}, failureThreshold: 30, periodSeconds: 1} (checks Envoy admin interface). App waits up to 30 seconds for Envoy to respond before marking startup complete. Main app container doesn't start its app until probe succeeds.

Solution 3 - Init container that waits for sidecar: Use init container to explicitly wait for sidecar: initContainers: [{name: "wait-for-sidecar", image: "busybox", command: ["sh", "-c", "until nc -zv localhost 8080 > /dev/null 2>&1; do echo waiting for sidecar; sleep 1; done"]}]. This blocks pod startup until sidecar port is open. Main app only starts after port is listening.

Best practice: Combine startup probe with init container. Init container does lightweight setup (iptables, environment). Startup probe ensures app doesn't crash before sidecar is ready. Example: initContainers: [...], containers: [{name: "app", startupProbe: {tcpSocket: {port: 8080}, failureThreshold: 30, periodSeconds: 1}}, {name: "sidecar", image: "envoy:latest"}].

Container startup order: Kubernetes starts all containers in parallel (no order guarantee). Don't rely on Deployment spec order. Use lifecycle hooks (postStart/preStop) on sidecar: lifecycle: {postStart: {exec: {command: ["sh", "-c", "echo ready > /tmp/sidecar-ready"]}}}. Then app's init container or startup probe checks for this file/signal.

Follow-up: Startup probe works. But during pod updates, the old sidecar is killed before the new app connects to it (race condition). How would you ensure graceful handoff during rolling updates?

Your pod has a logging sidecar (Filebeat) that ships logs to Elasticsearch. When the main app crashes and is restarted, logs in /var/log are lost because the pod's local filesystem is ephemeral. Filebeat hasn't shipped all logs yet. You need to preserve logs even if app crashes.

Problem: Pod local storage (/var/log) is ephemeral. When container restarts, logs are lost. Even if a sidecar is present, it may not have time to ship logs before crash.

Solution 1 - Shared volume for logs: Use a shared emptyDir volume between app and Filebeat: volumes: [{name: "logs", emptyDir: {}}], containers: [{name: "app", volumeMounts: [{name: "logs", mountPath: "/var/log"}]}, {name: "filebeat", volumeMounts: [{name: "logs", mountPath: "/var/log"}]}]. App writes logs to /var/log, Filebeat reads and ships. emptyDir persists across container restarts (within same pod). Only lost when pod is deleted.

Solution 2 - PersistentVolume for logs: Use PVC for persistent storage: volumeClaimTemplates: [{metadata: {name: "logs"}, spec: {storageClassName: "fast", accessModes: ["ReadWriteOnce"], resources: {requests: {storage: "10Gi"}}}}]. Logs survive pod deletion (stored on node disk or network storage). Downside: adds complexity, slower I/O.

Solution 3 - Direct log shipping from app: Instead of writing to disk, app ships logs directly to Elasticsearch/Datadog. Sidecar becomes optional. Pros: no disk I/O, logs never lost. Cons: app code must support it. Use structured logging (JSON output) and ship directly: app --log-format=json --log-backend=elasticsearch --log-endpoint=http://elasticsearch:9200.

Solution 4 - Aggressive Filebeat flushes: Configure Filebeat to flush frequently: filebeat.inputs: [{paths: ["/var/log/app.log"], close_timeout: "1s"}], output.elasticsearch: {bulk_max_size: 100, flush_interval: 1s}. Ships logs every 1 second, so even if app crashes, max 1 second of logs lost.

Solution 5 - Multi-container pod with shared process namespace: Enable shareProcessNamespace: true in pod spec. All containers share the same PID namespace. If app crashes, sidecar can detect it (watch /proc/app-pid) and trigger immediate log flush before exiting.

Recommended architecture: (1) App writes logs to shared emptyDir volume. (2) Filebeat sidecar reads and ships to Elasticsearch. (3) App also logs to stdout/stderr (collected by kubelet, shipped to central logging separately). (4) Use structured JSON logging so both log paths are queryable.

Follow-up: You've added persistent logging. But Filebeat sidecar crashes (memory leak) and is restarted. During restart, app still writes logs but they're dropped. How would you handle sidecar restarts without losing app logs?

Your init container sets up database schema before the app starts. If the database is unavailable during init, the entire pod stays Pending forever (init never completes, so main container never starts). You need to allow the pod to start and retry the schema setup periodically instead.

Init containers are blocking: pod waits for all init containers to succeed before starting main containers. If an init container fails, pod is stuck in Pending (or CrashLoopBackOff if restart policy is configured). This is too strict for some scenarios (e.g., database temporarily unavailable).

Solution 1 - Move DB schema setup to app startup: Instead of init container, have the app attempt schema setup on startup: app: {entrypoint: ["sh", "-c", "until mysql -h db -e 'CREATE TABLE IF NOT EXISTS users'; do sleep 5; done; exec java -jar app.jar"]}. App retries connecting to DB every 5 seconds. If DB comes up, schema is created and app starts. If DB is down, app is pending startup but pod is Running (not Pending). Scheduler sees pod as Running and considers it healthy.

Solution 2 - Optional init container (non-blocking): Use init container but make it fail gracefully: initContainers: [{name: "db-setup", image: "mysql:8", command: ["sh", "-c", "mysql -h db -u root -e 'CREATE TABLE ...' || true"]}]. The || true makes init exit code 0 even if SQL fails. Pod continues regardless. Main app retries later if needed.

Solution 3 - Init container with readiness probe fallback: Init container attempts setup, main app has readiness probe to retry: initContainers: [{name: "db-setup", command: ["sh", "-c", "mysql ... || echo 'DB schema not ready, will retry in readiness probe'"]}], readinessProbe: {exec: {command: ["sh", "-c", "mysql -h db -e 'SELECT 1 FROM users LIMIT 1'"]}, failureThreshold: 5, periodSeconds: 10}. Pod starts (init succeeds or fails). App container waits for readiness probe to succeed (retries DB connection every 10s). Traffic only routed to pod after readiness passes.

Solution 4 - Hook container setup into app lifecycle: Use postStart hook for schema initialization: containers: [{name: "app", lifecycle: {postStart: {exec: {command: ["sh", "-c", "until mysql -h db -e 'CREATE TABLE ...'; do sleep 5; done"]}}}}]. This runs after app container starts but before it's marked as ready. Similar to init, but doesn't block pod startup.

Recommended: Combine init + app startup handler. Init container handles lightweight setup (usually succeeds). App startup handler (loop in entrypoint) handles retryable operations (DB connection). Readiness probe verifies app is fully ready. This allows flexibility: init failures don't block pod, app retries at its own pace.

Testing: To simulate database unavailability, kill the database pod: kubectl delete pod -n db mysql. Watch pod status: kubectl get pods -w. With Solution 1, pod should be Running (but not Ready). With init-only approach (no retries), pod stays Pending. Verify by checking pod events: kubectl describe pod pod-name | grep -A 10 Events.

Follow-up: App startup handler retries DB connection successfully. But app needs to run multiple initialization steps in order (schema, seed data, migrations). How would you orchestrate these steps with timeout and rollback?

Your pod has 3 containers: app, sidecar-logging, sidecar-monitoring. App container uses 2GB memory, each sidecar uses 500MB. When app crashes (OOMKilled), the entire pod is terminated—sidecars stop immediately without flushing data. You lose monitoring metrics and application traces.

Pod-level OOMKilled: Linux kernel OOM killer targets the highest memory consumer (the app). Kernel sends SIGKILL. When app is killed, kubelet terminates the entire pod (including sidecars) by sending SIGTERM to all containers. If sidecars need time to gracefully flush, SIGTERM + grace period may not be enough—OOMKilled bypasses grace period.

Root cause: Resource requests/limits misaligned. Check pod resource requests: kubectl get pod pod-name -o yaml | grep -A 5 resources. If no memory request is set for sidecars, scheduler assumes they're zero, packs too many pods on node. When actual usage spikes, kernel OOM killer is triggered.

Solution 1 - Set proper resource requests: Ensure all containers have memory requests: resources: {requests: {memory: "500Mi"}, limits: {memory: "500Mi"}} on sidecars. This tells scheduler how much memory pod needs (2GB + 500MB + 500MB = 3GB total). If node doesn't have 3GB available, scheduler doesn't place pod there. Pod starts on a node with sufficient memory, avoiding OOMKilled.

Solution 2 - Separate QoS classes: App container: requests: {cpu: 2, memory: 2Gi}, limits: {cpu: 2, memory: 2Gi} (Guaranteed QoS). Sidecars: requests: {cpu: 500m, memory: 500Mi}, limits: {cpu: 1, memory: 1Gi} (Burstable QoS). Burstable containers can use more than requested (within limits). If node memory pressure, Kubernetes evicts Burstable first, keeping Guaranteed. But entire pod is terminated if any container is OOMKilled.

Solution 3 - Graceful shutdown handler: Add preStop hook to app to save state before crash: lifecycle: {preStop: {exec: {command: ["sh", "-c", "sleep 5; curl http://localhost:8080/shutdown"]}}}. This gives app time to signal sidecars before shutdown. Sidecars listen and flush gracefully. But doesn't help if kernel OOM killer fires (preStop not called on SIGKILL).

Solution 4 - Separate monitoring sidecar from app: Use a separate pod for monitoring (Prometheus scraper) instead of sidecar in same pod. If app pod OOMKilled, monitoring pod continues scraping metrics from other instances. No dependency on app pod's grace period. Downside: less tight coupling, higher overhead.

Solution 5 - Enable swap (careful): Linux can use swap to avoid OOMKilled. sysctl -w vm.swappiness=60 on node. But swap is slow (disk I/O), causes latency spikes. Not recommended for production.

Best practice: Set memory requests equal to limits for app container (Guaranteed QoS). This reserves exact memory upfront, avoiding overcommit. Sidecars use lower limits (Burstable). Use vertical pod autoscaling (VPA) to suggest resource values based on actual usage: kubectl apply -f vpa-recommendation.yaml. Monitor node memory: kubectl describe node | grep -A 10 "Allocated resources". Alert if >85% allocated.

Follow-up: You've set resource requests correctly. Pod still OOMKilled during traffic spike (actual memory usage exceeds limit). You can't remove limits (limits prevent noisy neighbors). How would you redesign to handle bursting?

Your pod has an init container that downloads a 500MB ML model file before the app starts. Init container completes, model file is cached in emptyDir. But when the app container restarts (due to crash), the model file is deleted and re-downloaded (5 minute delay). You need to preserve the model across app restarts within the same pod.

emptyDir is pod-scoped: when pod is deleted, emptyDir is deleted. But when a single container within the pod crashes and restarts, emptyDir persists (shared with other containers). So model file should survive app container restart.

Diagnosis: Check if file is actually preserved. After app crashes: kubectl get pods pod-name—note restart count (should increment). Check if model file exists: kubectl exec pod-name -c app -- ls -la /ml-models/. If file is missing, one of two issues: (1) emptyDir was deleted (entire pod recreated, not just container), or (2) file was intentionally deleted (app cleanup).

Verify pod stability: kubectl get pods pod-name -o yaml | grep restartPolicy. Default is Always (restart crashed container). Check pod events: kubectl describe pod pod-name | grep -A 5 Events. Look for "pod evicted", "node deleted", "kubelet restart"—these indicate pod was deleted and recreated (emptyDir lost). If events show "container restarted", pod persisted (emptyDir should persist).

Solution 1 - Verify emptyDir usage: Explicitly mount emptyDir for model: volumes: [{name: "models", emptyDir: {sizeLimit: "1Gi"}}], initContainers: [{name: "download-model", volumeMounts: [{name: "models", mountPath: "/ml-models"}]}], containers: [{name: "app", volumeMounts: [{name: "models", mountPath: "/ml-models"}]}]. Both init and app mount same volume. If app crashes and restarts, models volume is preserved within the pod.

Solution 2 - Cache on host node: Use hostPath volume to cache on node disk: volumes: [{name: "models", hostPath: {path: "/var/cache/ml-models", type: "DirectoryOrCreate"}}]. Model is cached on node, shared across all pods on that node (if they use hostPath). Next pod on same node reuses cache. Downside: requires node disk space, pods migrate to different node = re-download.

Solution 3 - ConfigMap/Secret for model: If model is <1MB, store in ConfigMap: kubectl create configmap ml-model --from-file=model.bin. Then mount as volume: volumes: [{name: "model", configMap: {name: "ml-model"}}]. Model is stored in etcd, replicated across cluster, persistent. But etcd has size limits (default 100MB), not suitable for large models.

Solution 4 - Init container with init check: Modify init to detect if model already exists: initContainers: [{name: "download-model", command: ["sh", "-c", "if [ ! -f /ml-models/model.bin ]; then curl ... > /ml-models/model.bin; fi"]}]. If file exists (from previous container in pod), skip download. App restarts quickly (model already in emptyDir).

Recommended: Use Solution 1 (explicit emptyDir) + Solution 4 (skip re-download if exists). This ensures: (1) model persists across container restarts within same pod, (2) download is skipped if file already present, (3) pod is resilient to crashes.

Verification: Test by killing app container: kubectl delete pod pod-name --grace-period=0 --force (kill pod entirely) vs kubectl exec pod-name -c app -- kill -9 1 (kill app process, container restarts). In the second case, check that model file remains and app startup is fast.

Follow-up: Model file persists across restarts now. But init container is re-running every container restart (even though model exists). Disk I/O is high. How would you optimize to skip init entirely if model already exists?

Your pod spec defines init container order: init1 -> init2 -> init3. init1 and init2 are quick (200ms), but init3 is slow (10 seconds). You need to start the app after init1 and init2 complete, without waiting for init3. Init containers run sequentially by default. How do you parallelize them?

Kubernetes init containers are strictly sequential: init1 must complete before init2 starts. If you want parallel execution, you must work around the design.

Solution 1 - Move slow init to background sidecar: Instead of init3 running before app, move it to a sidecar that runs parallel with app: initContainers: [{name: "init1"}, {name: "init2"}], containers: [{name: "app"}, {name: "init3-sidecar"}]. init1 and init2 run sequentially, complete. Then app and init3-sidecar start in parallel. App can start using results from init1/init2, while init3-sidecar does background work.

Solution 2 - Combine multiple inits into one: Merge init1, init2, init3 into a single init container that parallelizes internally: initContainers: [{name: "setup", command: ["sh", "-c", "(init1.sh &); (init2.sh &); (init3.sh &); wait"]}]. Single init runs all three scripts in background with &, waits for all with wait. From Kubernetes perspective, single init container, but internally parallelized. This is clean and simple.

Solution 3 - Use init container that forks and returns immediately: Init container forks init3 to background and exits immediately: initContainers: [{name: "setup", command: ["sh", "-c", "init1.sh && init2.sh && nohup init3.sh > /var/log/init3.log 2>&1 &"]}]. init container exits after starting init3, app begins. init3 continues in background. Downside: if init3 fails, app doesn't know. Need health check to verify.

Solution 4 - Use Job for slow init: Instead of init container, create a separate Job for slow setup: kubectl create job setup-job --image=myimage -- init3.sh. Main pod depends on Job completion via init container that waits: initContainers: [{name: "wait-for-job", command: ["sh", "-c", "until kubectl get job setup-job -o jsonpath={.status.succeeded} | grep 1; do sleep 1; done"]}]. Jobs are independent, can run in parallel with pod startup.

Solution 5 - App with graceful degradation: App starts immediately after init1/init2, but doesn't use init3 results until available. Use readiness probe: readinessProbe: {exec: {command: ["sh", "-c", "test -f /var/init3-done"]}}. init3 sidecar creates /var/init3-done when complete. Until then, pod is NotReady (traffic not routed). When init3-done appears, app becomes Ready. App can still run, but won't receive traffic until fully initialized.

Best practice: Combine Solution 2 (parallel inits) + startup probe. Single init container with parallelized commands internally. Startup probe ensures app doesn't crash before all inits complete: startupProbe: {exec: {command: ["sh", "-c", "test -f /var/init1-done && test -f /var/init2-done && test -f /var/init3-done"]}, failureThreshold: 60, periodSeconds: 1}. App waits up to 60 seconds for all init flags to appear.

Follow-up: You've parallelized inits (init1, init2 in parallel, init3 as background). But init1 needs to complete before init2 starts (dependency). How would you maintain ordering while still parallelizing init3?

Your pod has an init container that runs as root (to set up network namespace, mount volumes). After init completes, the app container runs as non-root (unprivileged user). During pod shutdown, you need the init-level cleanup (teardown network) but the app container is non-root and can't do it. How do you handle privileged cleanup on shutdown?

Pod lifecycle: Init containers run (usually as root). App containers run (usually as unprivileged). On shutdown, kubelet sends SIGTERM to all containers, waits grace period, then SIGKILL. There's no automatic cleanup step for init containers on shutdown.

Problem: If init container set up privileged resources (iptables rules, mount points), they need to be cleaned up on shutdown. App container is unprivileged and can't undo root-level changes. If not cleaned up, node resources leak (zombie mounts, iptables rules accumulate).

Solution 1 - Use a privileged sidecar for cleanup: Run a privileged sidecar alongside the app that handles cleanup: containers: [{name: "app", securityContext: {runAsNonRoot: true}}, {name: "privileged-cleanup", securityContext: {privileged: true}, command: ["sh", "-c", "sleep infinity & trap 'cleanup.sh' EXIT"]}]. Sidecar stays running, listens for signals. On SIGTERM (pod shutdown), cleanup.sh is called (runs as root). App container is unprivileged but cleanup happens via sidecar.

Solution 2 - Init container does both setup and cleanup: Init container can register cleanup handlers using trap: initContainers: [{name: "setup", command: ["sh", "-c", "setup.sh && trap 'cleanup.sh' EXIT; sleep infinity"]}]. Init container doesn't exit (runs sleep infinity). On pod shutdown, trap catches EXIT signal and runs cleanup. But this blocks app from starting (init never completes). Workaround: Use setsid to run setup in background, init container exits but setup process continues: initContainers: [{command: ["sh", "-c", "(setsid setup.sh &) && exit 0"]}]. This is fragile.

Solution 3 - Dedicated cleanup pod per node (DaemonSet): Instead of per-pod cleanup, run a DaemonSet cleanup agent on each node. Agent watches for pod shutdown events and performs cluster-level cleanup: kubectl apply -f cleanup-daemonset.yaml. DaemonSet calls cleanup scripts when pods are deleted. Simpler than per-pod cleanup, but less precise.

Solution 4 - App container with capability drop, not full unprivileged: Instead of runAsNonRoot: true, use securityContext: {runAsUser: 1000, capabilities: {drop: ["ALL"], add: ["NET_ADMIN"]}}. App runs as non-root user but retains NET_ADMIN capability. App can perform network cleanup (iptables, routing) without full root. Setup still done by init as root.

Solution 5 - Use systemd service management: On each node, register cleanup service with systemd. When pod shutdown event is detected (via kubelet event stream), systemd service runs cleanup: systemctl start cleanup@pod-name.service. Cleanup service (runs as root) performs teardown. This integrates with OS-level process management.

Recommended: Use Solution 1 (privileged sidecar). App container is unprivileged (safe). Sidecar handles privileged operations (setup during init, cleanup on shutdown). Sidecar runs minimal code (just trap handlers), low resource overhead. Example: containers: [{name: "app", securityContext: {runAsNonRoot: true}}, {name: "init-cleanup", securityContext: {privileged: true}, command: ["sh", "-c", "echo 'Cleanup handler ready'; trap 'iptables -t nat -F; umount /custom-mount' SIGTERM; sleep infinity"]}].

Follow-up: Cleanup sidecar's trap handler runs, but iptables commands fail (rule already deleted by another component). How would you make cleanup idempotent and handle partial cleanup failures?

Your pod uses an init container to pull large binary artifacts (3GB total) from a corporate artifact repository over a VPN connection. Pull takes 5 minutes on first run but should use cache on subsequent pod restarts. However, you can't use emptyDir (pod scheduled on different nodes). How would you cache artifacts across pod boundaries on the same node?

Pod restarts within the same node should reuse artifacts from previous pod (if cached on node disk). emptyDir is pod-scoped, so moving to different node loses cache. Solution: use hostPath or node-level caching.

Solution 1 - hostPath volume on designated cache directory: Mount a hostPath that points to node-level cache: volumes: [{name: "artifacts", hostPath: {path: "/var/cache/artifacts", type: "DirectoryOrCreate"}}], initContainers: [{name: "fetch-artifacts", volumeMounts: [{name: "artifacts", mountPath: "/cache"}], command: ["sh", "-c", "if [ -f /cache/artifacts.tar.gz ]; then tar -xzf /cache/artifacts.tar.gz -C /app; else curl ... > /cache/artifacts.tar.gz && tar -xzf /cache/artifacts.tar.gz -C /app; fi"]}]. Every pod on the same node shares /var/cache/artifacts. Cache persists across pod restarts (not deleted when pod terminates).

Solution 2 - Dedicated cache DaemonSet: Run a privileged DaemonSet pod on each node that maintains local cache and serves via Unix socket or HTTP. Main pod's init container connects to the cache daemon: init container: curl http://cache-daemon:8080/fetch?artifact=app-binaries. Cache daemon pre-pulls large artifacts in background (during off-peak). Pods fetch from cache daemon (local network, fast). Cache daemon handles deduplication and cleanup.

Solution 3 - Kubelet image cache integration: If artifacts are published as OCI images (stored in artifact registry), use image pull caching (kubelet caches layers). Push artifacts as image: docker build -f Dockerfile.artifacts -t artifacts:3.0 . && docker push corp-registry/artifacts:3.0. Pod init container: FROM artifacts:3.0 COPY . /cache. Kubelet caches image layers on node, subsequent pods skip download.

Solution 4 - Local artifact registry on each node: Run a local Docker registry mirror on each node (via DaemonSet). Registry pulls artifacts once, caches locally. All pods on node fetch from local registry (instant, local network). Example: kubectl apply -f registry-cache-daemonset.yaml deploys registry on each node. Configure containerd to use node-local registry as pull-through cache.

Gotcha: hostPath cache persists after pod deletion, consuming node disk space. Implement cleanup: DaemonSet cron job that removes unused artifacts older than 7 days: find /var/cache/artifacts -atime +7 -delete. Or use age-based eviction: du -sh /var/cache/artifacts && if [ $(du -sb /var/cache/artifacts | cut -f1) -gt 50000000000 ]; then find ... -delete; fi (delete if >50GB).

Recommended: Use Solution 1 (hostPath) for simplicity. Add init container check to skip download if artifact exists. For production: add DaemonSet cleanup job to manage disk space. Monitor cache hit rate: prometheus metric: init_container_artifacts_cache_hit_rate. If <90%, cache may be too small or cleanup too aggressive.

Follow-up: Node local cache works, but pod is scheduled on different node after cluster scale-down. You lose cache and fetch takes 5 minutes again. How would you handle cache locality in your pod affinity or proactively pre-warm caches?