Your Kubernetes orchestrator is restarting a container every 90 seconds. The app itself works fine, but the health check endpoint is returning 200 OK inconsistently. You trace it: during high load, the health check endpoint gets queued behind slow requests and times out. Meanwhile, Kubernetes marks it as unhealthy and restarts the container. The restarts cause cascading failures. How do you fix the health check?
Health check endpoints must be fast and independent of your main workload—they're a separate concern. Issues: (1) Health check shares thread/worker pool with business logic, causing blocking under load. (2) Health check does expensive operations (database queries, disk checks). (3) Timeout is too short. Fix: (1) Implement a dedicated, lightweight health endpoint with isolated resources. In Node.js: app.get('/health', (req, res) => res.json({status: 'ok'})) without any DB queries. (2) Set health check timeout to at least 3-5 seconds (not 1 second). In Kubernetes: livenessProbe: httpGet: path: /health, port: 8080, initialDelaySeconds: 30, timeoutSeconds: 5, periodSeconds: 10, failureThreshold: 3. (3) For deeper checks, implement a separate startup probe that validates DB connections: startupProbe runs once at container start. (4) Use readiness probes (return 503 during shutdown) separate from liveness probes. This decouples health checks from business traffic and ensures orchestrators have accurate signals about container health.
Follow-up: What's the difference between liveness, readiness, and startup probes? When should you implement each?
Your health check returns 200 OK, but the service is actually degraded: Redis cache is down (but not required for startup), and requests are slow. The orchestrator doesn't know about this degradation and keeps the container in service. Meanwhile, traffic piles up and clients experience timeouts. Design a multi-level health check that reports degradation.
HTTP 200 is binary—it doesn't express degradation. Implement a health check that reports status and severity levels: /health returns 200 with JSON body {status: 'healthy'} or {status: 'degraded', reason: 'redis_timeout', severity: 'medium'}. Orchestrators (Kubernetes, Docker Swarm) interpret this via custom logic. Better approach: (1) Use HTTP status codes more precisely: 200 = fully healthy, 503 = unhealthy (orchestrator removes from load balancer), 200 + degraded flag in body for informational purposes. (2) Implement a separate /health/live (liveness—am I alive?) vs /health/ready (readiness—can I serve traffic?). Return 503 from /health/ready if Redis is down. Return 200 from /health/live as long as the app process is running. (3) Configure orchestrator readiness probes to respect the 503: if /health/ready returns 503, remove from load balancer but don't restart. This gives you time to debug. (4) Emit metrics separately: counter for degradation events, gauge for service dependencies' health. In Node.js: app.get('/health/ready', (req, res) => { if (!redis.connected) return res.status(503).json({status: 'degraded'}); res.json({status: 'ready'}); }). This allows graceful degradation without container restarts.
Follow-up: How do you coordinate health checks across multiple microservices? If service A depends on B, should A's health check fail if B is down?
You deploy a new version with a flawed health check: it queries the database on every call. Your database is now getting 100 requests/second from health checks alone (10 containers × 10 checks/sec per container). The DB becomes saturated and your actual traffic fails. The health check is causing the outage. How do you prevent this?
Health checks that query databases are expensive at scale. Prevention: (1) Use an in-memory cache for health status. The main app updates the cache every second or on events. The health endpoint reads from cache and returns immediately. Example: every 1 second, async task checks DB connectivity and stores result in memory. Health endpoint: return cached_status without querying DB. (2) Use separate, isolated database connections for health checks (if you must query DB). Limit the connection pool: max 1-2 connections for health, 20+ for business logic. (3) Cache the result with a TTL. Implement: if (cached_health_time > 5_seconds_ago) return cached_health; else refresh. (4) Monitor health check latency and cardinality: emit metric for every health check, alert if any takes > 100ms or if the check rate spikes. (5) Rate-limit health checks on the app side: reject health checks if too many arrive in a second. In Kubernetes, set periodSeconds: 30 (health checks every 30s, not 10s). The key principle: health checks are purely informational—don't do work in them. Delegate that work to separate background tasks.
Follow-up: How do you test health check behavior under load? What's a realistic simulation?
A container's health check is timing out inconsistently. Sometimes it passes, sometimes it fails, even though the app is stable. You trace the issue: the health check is making a DNS lookup for a service that's sometimes slow. Under high DNS load, the lookup takes 3 seconds, exceeding your 2-second timeout. The container gets restarted unnecessarily. Fix the timeout and DNS issue.
DNS lookups can be unpredictable in containerized environments, especially if the DNS resolver is overloaded. Issues: (1) Health checks shouldn't depend on external services (including DNS). (2) Timeouts are too tight for operations with variable latency. Fix: (1) Remove DNS lookups from health checks. Instead, resolve service names during app startup or cache them. Store the IP in memory. (2) Increase health check timeout: use at least 5 seconds if any I/O is involved (network calls). (3) Add DNS caching at the OS level in the container: set up nscd (name service cache daemon) or configure Docker's internal DNS resolver with caching. (4) Use service discovery that doesn't rely on DNS resolution on every request (e.g., service mesh sidecars that cache endpoints). Implementation: In your health endpoint, avoid any DNS queries. If you must connect to another service, resolve it once on startup: service_ip = socket.getaddrinfo('service', 80); then use service_ip in health checks. Set health check timeout to 5-10 seconds and increase failureThreshold to 5 to tolerate occasional blips. This ensures health checks are deterministic and fast.
Follow-up: How does Kubernetes' service DNS work internally? Why is it eventually consistent?
Your app container starts but the database migration takes 45 seconds to complete. Your health check runs immediately (initialDelaySeconds: 5) and fails because the database schema doesn't exist yet. Kubernetes thinks the container is dead and kills it mid-migration. Design a startup sequence that prevents this.
Startup operations that take time (migrations, cache warming, initialization) must complete before health checks run. Solutions: (1) Use Kubernetes startup probes, which don't fail the container even if they fail repeatedly—they just keep retrying until success. Set startupProbe with long failureThreshold (e.g., 30 retries × 10 second period = 5 minutes). The container won't be restarted during this window. (2) In your app, implement a startup phase: on boot, run migrations and cache warming to completion, then mark the app as ready. Only then allow health checks to pass. (3) Set initialDelaySeconds conservatively: initialDelaySeconds: 60 (wait 60 seconds before first health check). (4) Use a script in your Dockerfile entrypoint that runs migrations before starting the app: RUN ... && migrate.sh && exec app.sh. Configure probes: startupProbe: httpGet: path: /health, initialDelaySeconds: 5, periodSeconds: 10, failureThreshold: 30 (allows 5 minutes). livenessProbe: httpGet: path: /health, initialDelaySeconds: 40, periodSeconds: 10. This ensures migrations complete before the container is considered alive, and Kubernetes won't interrupt startup.
Follow-up: What's the difference between startupProbe and initialDelaySeconds? When should you use each?
You have a gRPC service running in a container. Docker's built-in health check mechanism doesn't support gRPC probes natively—it only supports HTTP and TCP checks. Your gRPC service is healthy but Docker's liveness check (using TCP port check) fails because the port is open but the service isn't responding to gRPC health checks. The container gets marked unhealthy. How do you implement gRPC health checks?
Docker's native health checks don't understand gRPC's Health Check Protocol. Solutions: (1) Implement gRPC Health Check Protocol (https://github.com/grpc/grpc/blob/master/doc/health-checking.md): the service must implement the grpc.health.v1.Health service, which responds to Check and Watch RPCs. Most gRPC libraries (Go, Python, Node.js) have built-in health check implementations. (2) Use a separate HTTP endpoint alongside gRPC. Expose an HTTP /health endpoint on a different port that mirrors gRPC health status. Docker checks the HTTP endpoint. (3) Use grpcurl tool to check gRPC health: HEALTHCHECK CMD grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check. (4) For Kubernetes (not Docker), use exec probes: exec: command: [grpcurl, -plaintext, localhost:50051, grpc.health.v1.Health/Check]. In Go, enable built-in health checking: import grpc/health and register it with your server. In Python, use grpcio health package. This decouples health check implementation from the gRPC service logic and ensures Docker/Kubernetes can accurately assess service health.
Follow-up: How does the gRPC Health Check Protocol handle streaming services? What about asynchronous health updates?
You implement a health check that returns 200 but does heavy validation (checks 50 database tables, validates cache consistency, scans disk). The check takes 20 seconds. Because it's expensive, you run it only every 60 seconds. But between checks, the system degrades (cache becomes invalid, tables get corrupted) and users get errors. Your infrequent health check misses the degradation. How do you balance cost and accuracy?
Expensive health checks create a dilemma: frequent checks hammer resources; infrequent checks miss degradation. Solution: (1) Separate concerns: implement a fast liveness check (proves the app process is alive) and a slower but less frequent deep check (validates system state). Example: /health/live (fast, runs every 5 seconds) vs /health/deep (expensive, runs every 60 seconds). (2) Replace expensive synchronous checks with continuous async health monitoring. Instead of checking tables in the health endpoint, run a background task that continuously validates table integrity and updates a flag. The health endpoint just reads this flag. (3) Use synthetic transaction monitoring outside the container: send test requests to your app periodically and measure latency/success. This replaces the need for deep health checks inside the container. (4) Implement circuit breakers for resource consumption: if health check takes > 1 second, timeout and fail-open (assume healthy rather than blocking). (5) Use metrics and alerting to catch degradation early, separate from health probes. In Node.js: fast health endpoint responds in <10ms. Async background task (runs every 30s) validates cache; if invalid, emit metric and update status_flag. Health endpoint returns status_flag immediately. This keeps probes fast while catching degradation via async monitoring.
Follow-up: How do you implement synthetic monitoring for a service that has expensive checks? What's a realistic approach?
Your container health check is working, but you notice that during routine maintenance (kernel patching, host reboots), all containers restart unnecessarily because their health checks fail during the transition. The health check has no tolerance for brief transient failures. You want to improve resilience to node maintenance. Design a robust health check configuration.
Health checks are too sensitive during node transitions. Tuning: (1) Increase failureThreshold to tolerate temporary failures. Instead of restarting on first failure, allow 3-5 consecutive failures before restarting. In Kubernetes: livenessProbe: failureThreshold: 5 (allows 5 consecutive failures). (2) Increase periodSeconds to reduce check frequency during transitions: periodSeconds: 15-30 (not 10). This reduces churn during temporary network hiccups. (3) Set successThreshold > 1 if you want to require multiple successes before marking healthy again. (4) Implement exponential backoff in health checks: if a check fails, don't immediately retry harder. Wait longer between retries. (5) Use PodDisruptionBudgets (in Kubernetes) to prevent too many pods from being disrupted simultaneously during maintenance. PDB ensures at least N replicas stay running. (6) Emit metrics: track health check failure rate; if it spikes during maintenance windows, increase thresholds. In Kubernetes: livenessProbe: httpGet: path: /health, port: 8080, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 5, failureThreshold: 5. This configuration is tolerant of brief failures and network transitions, reducing unnecessary restarts during infrastructure maintenance.
Follow-up: How do you coordinate health checks with planned maintenance windows? Should health checks be aware of maintenance mode?