Python Interview — GIL, Concurrency, and Parallelism

Your Django application processes 10k requests/sec. You've implemented async views with `asyncio`, but CPU-bound tasks (image processing, cryptography) are blocking the event loop. How do you scale this without redesigning the entire stack?

Use `loop.run_in_executor()` to offload CPU-bound tasks to a thread pool or process pool. For thread pool (IO-bound): `executor = ThreadPoolExecutor(max_workers=20); loop.run_in_executor(executor, blocking_func)`. For CPU-bound with GIL: spawn separate processes via `ProcessPoolExecutor`, or use `concurrent.futures.ProcessPoolExecutor` with work queues. Bind CPU tasks to specific cores via `os.sched_setaffinity()` to reduce cache misses. Monitor GIL contention with `sys._current_frames()` to detect lock-heavy operations. At scale, consider multiprocessing with Unix sockets (faster than TCP) or move heavy compute to Celery workers on separate machines—this completely bypasses the GIL.

Follow-up: If you spawn 50 worker processes for CPU-bound tasks, how do you prevent memory explosion and coordinate shutdown without data loss?

A financial backend processes trades with 8 Python worker threads. After a VM migration, tail latency (p99) spikes from 15ms to 200ms. GIL contention profiling shows threads are holding the lock for microseconds, yet stalled wall-clock time is orders of magnitude higher. What's likely happening?

Classic GIL thrashing: threads are waking up, competing for the GIL, yielding, and repeating without productive work. Measure with `threading.stack_size()` instrumentation. Root causes: (1) VM host CPU oversubscription (8 threads on 4 cores), (2) scheduler tick conflicts—threads release GIL every ~5ms (`sys.getswitchinterval()`), causing thundering herd on lock acquisition, (3) unnecessary lock contention in shared data structures (lists/dicts instead of thread-local or queue-based). Solutions: pin threads to cores with `taskset` or `threading.local()` + reduce thread count. Or migrate to `free-threaded CPython 3.13+` (PEP 703) which removes the GIL entirely. For immediate fix: increase `sys.getswitchinterval()` to reduce context switches, or use `multiprocessing` with IPC queues.

Follow-up: How would you simulate and test for GIL contention in CI/CD before deployment?

You're migrating a streaming data pipeline from threading to asyncio. The old code used 100 threads, each with blocking `socket.recv()`. After converting to `asyncio.StreamReader`, you see 30% performance drop and high memory usage. What went wrong?

Common pitfalls: (1) blocking operations still present (e.g., `requests.get()` instead of `aiohttp`), creating implicit thread pools that block the loop, (2) excess coroutine creation—100 long-lived threads become 100k short-lived coroutines, fragmenting memory, (3) event loop overhead—asyncio context switches are cheaper than OS threads but not free; too many coroutines fighting for loop time. Debug with `asyncio.run(..., debug=True)` to detect blocking calls. Replace blocking libraries: use `aiohttp` for HTTP, `asyncpg` for Postgres, `motor` for MongoDB. Cap concurrent coroutines with `asyncio.Semaphore(100)`. Consider hybrid: threads for truly blocking external APIs + asyncio for your own code. Profile with `cProfile` and `py-spy` to measure event loop stall time.

Follow-up: How do you benchmark asyncio vs threading for a specific workload to decide which is worth the conversion effort?

A background job processor spawns 64 worker processes for batch job execution. After running for hours, process count grows to 200+ and jobs queue unboundedly. Adding more workers makes it worse. What's the failure mode?

Job starvation due to context switch overhead overwhelming actual work. With N CPUs, O(N) processes is optimal; 200 processes on 64 cores means 3x oversubscription, spending 75% of CPU on scheduler overhead. Likely causes: (1) parent process spawning new workers without reaping completed ones—check `multiprocessing.Pool` vs manual `Process()` management, (2) blocking in worker code causes spawn-happy scheduler to keep starting more, (3) no backpressure mechanism on job queue. Solution: use `multiprocessing.Pool(processes=cpu_count())` which manages workers efficiently, or implement explicit max_workers with `asyncio.Semaphore`. Add telemetry: log worker state changes, queue depth, task duration. Implement graceful degradation: queue jobs, don't spawn more workers. Use `os.cpu_count()` as the ceiling. For truly dynamic workloads, use a job queue (Redis, RabbitMQ) with fixed worker pool, not spawning.

Follow-up: If 10% of jobs take 30x longer than average, how do you prevent them from stalling the entire pipeline?

You have 8 threads sharing a `threading.Lock()` around a critical section (DB writes). Profiling shows 30% of time is spent waiting for the lock, not executing critical code (which takes 2ms). Adding more threads makes lock contention worse. How do you unblock this?

Lock contention is a queuing problem. With 8 threads and 2ms critical section, if arrival rate > 400 requests/sec, queue builds. Options: (1) reduce critical section size—extract lock-free operations, cache reads. (2) use `threading.RWLock` (or `readerwriterlock` package) if reads dominate. (3) shard the lock: instead of 1 lock, use dict of locks keyed by record ID, so different threads don't block each other. (4) lock-free data structures—`queue.Queue` is thread-safe without explicit locks. (5) batch writes: accumulate changes in thread-local buffers, flush periodically under 1 lock. (6) move to async: single event loop + async DB driver eliminates threading overhead entirely. Measure with `threading.active_count()`, `time.perf_counter()` around lock acquire, and profile with `threading.stack_size()` instrumentation. For database writes specifically, consider bulk insert APIs (1 DB trip, 100 rows) vs 100 individual writes.

Follow-up: If you shard 100 locks across 8 threads, how do you ensure deadlock-free ordering for multi-record transactions?

A machine learning pipeline uses 4 worker processes to load and preprocess images in parallel. Adding a 5th worker decreases throughput by 15%. System has 8 CPU cores. Why?

Hyperthreading illusion vs actual core contention. Modern CPUs report 8 "cores" but may have fewer physical cores (e.g., 4 cores with 2-way hyperthreading). With 4 workers on 4 physical cores, each has full resources. 5th worker causes cache thrashing and OS context switch overhead, exceeding the benefits of parallelism. Use `os.cpu_count()` (logical) vs `len(os.sched_getaffinity(0))` or `psutil.cpu_count(logical=False)` (physical cores). For CPU-bound work, workers should equal physical cores. Verify with: run benchmark with 1, 2, 4, 8 workers, plot throughput—you'll see diminishing returns after physical core count. Use `numactl --hardware` to understand NUMA topology if available. Pin processes to cores: `taskset -p -c 0 PID` to prevent OS scheduler moving processes between cores, which destroys cache locality.

Follow-up: How do you optimize for NUMA systems where cross-socket memory access is 2-3x slower than local?

You're debugging a production outage where asyncio tasks are timing out after 30 seconds, but the actual database query takes 2 seconds. Investigation reveals `asyncio.TimeoutError` is raised mid-response. The timeout is from a client library wrapping all tasks. How do you preserve the deadline without losing partial results?

Use `asyncio.wait_for(..., timeout=T)` correctly: wrap inner operations, not the entire handler. Structure: outer timeout is client SLA (30s), inner timeouts are task-specific (query=5s, cache=1s). On `asyncio.TimeoutError`, catch and respond with partial data + retry header. Example: query returns 100 rows but timeout fires after 50—return 50 + `Retry-After` header. Never let timeout silently kill work: log it, track SLA misses. Use `asyncio.TaskGroup` (Python 3.11+) to manage multiple concurrent timeouts. For true deadline semantics: pass `deadline` through context vars, check before expensive operations: `if time.monotonic() > deadline - buffer: raise TimeoutError()`. Test timeout paths in staging—most teams only test happy path.

Follow-up: If a task completes after `asyncio.TimeoutError` cancellation, how do you prevent it from corrupting shared state?

You have a Python service with 200 threads handling WebSocket connections. Memory grows from 1GB to 8GB over 24 hours despite thread pool being at max capacity. Thread stacks are 8MB each. Are threads leaking or is this normal?

Likely not a thread leak (thread objects are garbage-collected), but thread stacks + heap fragmentation. Each thread reserves ~8MB stack (configurable: `threading.stack_size()`). 200 threads = 1.6GB just for stacks—this is normal. The 8GB indicates heap growth: either (1) connection handlers are caching data (use `__slots__` to reduce per-object overhead), (2) garbage collection pauses are infrequent, letting garbage accumulate—trigger `gc.collect()` periodically, (3) C extensions holding references, (4) memory fragmentation from allocator—use `tracemalloc` to find top allocators: `tracemalloc.take_snapshot().statistics('lineno')[:10]`. Verify thread count: `threading.active_count()` should be constant. If growing, threads aren't terminating—check for `while True` loops without exit conditions. For WebSockets with 200 concurrent, consider asyncio + `aiohttp` instead—async has 1-2KB overhead per connection vs 8MB per thread.

Follow-up: How would you safely reduce thread stack size from 8MB to 2MB without stack overflow errors in production?