Batch pipeline passes 1GB messages between 8 workers via `multiprocessing.Queue`. IPC is 40% of runtime. How do you optimize without redesigning?
Queues serialize/deserialize (pickle), slow for large payloads. Solutions: (1) shared memory: `multiprocessing.shared_memory.SharedMemory` allows multiple processes to access same memory without copying. Create buffer, pass handle via queue. (2) `multiprocessing.managers.Manager()` for shared dict/list, slower due to server overhead. (3) optimize serialization: use `cloudpickle` (faster) or `msgpack`/`protobuf` (schema-based). (4) batch messages: 100 items in 1 message, amortize serialization. (5) memory-map files: use `mmap`, pass filename via queue, workers mmap same file—no copying. Measure: profile pickle time with cProfile. Compare formats with timeit. At 1GB message size, even 1% overhead is significant. For batch pipelines: memory-map is fastest proven pattern.
Follow-up: How do you coordinate cleanup of shared memory across processes without leaks?
Using `multiprocessing.Pool` with 8 workers. After running 1000 jobs, pool size mysteriously grows to 20+ workers despite pool(processes=8). Workers accumulate, throughput drops. What's wrong?
Workers aren't being reused; new processes created per job. Likely causes: (1) not using `apply_async` correctly—each call creates process? No, Pool reuses. (2) exception in worker causes process to crash, new process spawned. Check for silent exceptions: add try/except in worker function, log all errors. (3) Pool context manager not used; processes leak. Use: `with multiprocessing.Pool(8) as pool: pool.map(...)`. (4) maxsize on internal queue filled, new processes spawned? No, maxsize limits queue, not processes. (5) hung workers: if worker hangs, pool spawns replacement. Set timeout on `get_result()`: `result = worker_future.get(timeout=60)`. Measure: `print(pool._state)` or `ps aux | grep worker` to see actual processes. For 1000 jobs with 8 workers: expect ~8 processes. If 20+, workers are leaking. Fix: ensure exceptions are caught, use context manager, set timeouts.
Follow-up: How do you implement fault-tolerant worker pools that automatically restart dead workers?
Spawning 64 worker processes for CPU-bound tasks. After 1 hour, memory per process grows from 50MB to 300MB. Process count stays at 64. Is this a memory leak?
Memory growth without leaking is possible: (1) per-process caching: if each worker caches results, cache grows. (2) garbage collection: memory allocated but not freed until GC runs. (3) memory fragmentation: allocator keeps freed memory, RSS grows. Solutions: (1) trace allocations: `tracemalloc` in worker to find top allocators. (2) force GC: `gc.collect()` after job batch to reclaim memory. (3) preallocate in worker: reuse buffers instead of allocating per-job. (4) swap processes: periodically kill/spawn new workers to reset memory. (5) monitor: if growth plateaus at 300MB per process and stays, normal. If continues growing, leak. Measure: `ps aux` before/after jobs, track RSS. For 64 * 300MB = 19GB: check if that's realistic for your task. Verify it's not intentional caching (by design, not leak).
Follow-up: How do you implement periodic worker restart without losing in-flight jobs?
Using `multiprocessing.Queue` for inter-process communication. Occasional deadlock: both producer and consumer blocked, queue partially filled. Killing one process unblocks the other. What's the race condition?
Queue deadlock usually involves: (1) unflushed buffers: producer writes to buffer, consumer reads from queue before buffer flushed. (2) bidirectional communication: producer waits for consumer ack, consumer waits for producer data—circular wait. (3) queue full (maxsize reached), producer blocks. Consumer blocks on something else. (4) lock contention: internal queue lock held during blocking operation. Solutions: (1) one-way communication: producer only writes, consumer only reads. No acks. (2) use `timeout` on `get()`/`put()`: `queue.get(timeout=5)` raises exception if blocked >5s, avoiding indefinite hang. (3) separate channels: one queue producer->consumer, another for acks consumer->producer. (4) use `multiprocessing.Pipe` (two-way) instead of Queue for bidirectional. Measure: add logging at queue.get/put to detect blocks. If timeout fires, investigate what consumer/producer are doing. Test: create scenario that triggers deadlock, verify timeout prevents hang.
Follow-up: How do you implement a producer-consumer pattern that's guaranteed deadlock-free?
Worker processes created via `multiprocessing.Process`. After spawning 1000 processes (batch job), OS reports "too many open files" error even though code doesn't open files explicitly. What's happening?
Each process consumes file descriptors: 3 for stdin/stdout/stderr, + internal pipes for IPC. 1000 processes * ~10 FDs = 10k FDs, OS limit typically 1024-4096. Solutions: (1) limit process count: `min(os.cpu_count(), 32)` instead of 1000. (2) increase OS limit: `ulimit -n 65536` or in code: `import resource; resource.setrlimit(resource.RLIMIT_NOFILE, (65536, 65536))`. (3) reuse process pool: `multiprocessing.Pool(32)` creates fixed workers, reuses for 1000 jobs. (4) close FDs: if worker doesn't need stdin, close it: `os.close(0)` in worker to save FD. (5) use job queue (Redis, RabbitMQ) with fixed worker pool instead of spawning 1000. Best practice: always use Pool for bulk jobs, not individual Process spawn. Measure: `lsof -p PID | wc -l` shows FDs per process. 1000 processes should use Pool, not spawn.
Follow-up: How do you configure and monitor file descriptor limits in production?
A service uses `multiprocessing` for parallelism. During shutdown, `pool.terminate()` kills workers, but some are stuck on I/O and don't die. Shutdown timeout is 30s. How do you force-kill workers safely?
`terminate()` sends SIGTERM, which is catchable. Stuck processes ignore it. Solutions: (1) implement signal handler in worker: `signal.signal(signal.SIGTERM, handler)` to gracefully cleanup, then exit. (2) wait with timeout: `pool.terminate(); pool.join(timeout=5); if pool._processes: forcekill()`. (3) use `pool.close()` then `join()` for graceful shutdown: close waits for jobs to finish, join waits for process. (4) send SIGKILL after SIGTERM timeout: `os.kill(pid, signal.SIGKILL)` is forceful (unkillable). (5) set job timeout: if individual jobs have `timeout`, pool knows to kill runaway jobs. Best: (1) implement signal handlers in workers, (2) use close()+join() for graceful shutdown, (3) SIGKILL as last resort. Measure: shutdown duration with/without signal handlers. Should be <30s total.
Follow-up: How do you implement graceful shutdown that ensures in-flight work is persisted?
Using `multiprocessing` on a 64-core machine. CPU utilization is 30% even though 32 workers are busy. Other 32 cores are idle. Why aren't all cores being used?
Possible causes: (1) processes not pinned to cores: OS scheduler may cluster processes on few cores, leaving others idle. Use `taskset` or `os.sched_setaffinity()` to pin each worker to specific core. (2) GIL effects: if workers do CPU-bound Python, GIL serializes them (wait, free-threaded Python 3.13+ removes this). (3) memory bandwidth: if workload is memory-intensive, all cores can be bandwidth-limited (shared memory bus). Adding more cores doesn't help. (4) job imbalance: if 32 jobs are imbalanced (some finish fast, some slow), other workers idle waiting for slow job. Use dynamic load balancing. (5) Python multiprocessing overhead: starting process is expensive; if job is short, overhead dominates. Batch jobs. Solutions: (1) pin processes: for i, pid in enumerate(workers): `os.sched_setaffinity(pid, {i % 64})`. (2) profile: is workload CPU, memory, or I/O bound? (3) measure throughput: ensure scaling (adding cores increases throughput). If utilization is 30% but throughput is good, OK (other cores are sleeping).
Follow-up: How do you implement NUMA-aware process binding on multi-socket servers?
Worker process deadlocks inside a C extension (numpy). Pool hangs indefinitely. `pool.terminate()` doesn't kill worker (it's stuck in C code, can't be interrupted). How do you handle this?
C extensions can't be interrupted by Python signals. Solutions: (1) wrap C calls with timeout via thread: start thread with timeout, if timeout, forcibly kill process. (2) use `signal.alarm()` in Unix: set alarm before C call, handler kills process if alarm fires. (3) implement timeouts at process level: wrap worker in timeout wrapper: if job takes >30s, `os.kill(worker_pid, signal.SIGKILL)`. (4) revert to older numpy/C extension version if bug is known. (5) avoid deadlock-prone patterns: don't use `numpy.linalg.solve()` on singular matrices (can hang). Test: verify C extension doesn't deadlock before deploying. For production: use timeouts + monitoring. If C calls ever deadlock, consider moving to separate service or rewriting in Cython (allows interruption). Best: catch and report bugs to library maintainers.
Follow-up: How do you implement timeouts for C extension calls that can't be interrupted?