Your event loop-based application stops processing new requests after 8 hours. CPU and memory are stable, but `asyncio.gather()` calls hang indefinitely. Restarting fixes it temporarily. What's the root cause and how do you diagnose production issues without restart?
Classic event loop deadlock: a pending coroutine is blocked waiting for an event that will never arrive, holding references that prevent garbage collection and blocking all downstream work. Common causes: (1) task waiting on a future that was never set (missing `future.set_result()` or `future.set_exception()`), (2) task cancelled but exception not handled, leaving it suspended, (3) circular dependency between coroutines, (4) blocking call (e.g., `time.sleep()`) inside async code, freezing the loop. Diagnose with: `asyncio.all_tasks()` to list all pending tasks, inspect their stack with `task.get_stack()`, check for `asyncio.wait(..., timeout=0)` to see tasks by state. Add instrumentation: `asyncio.get_running_loop()._callbacks` to see pending callbacks. Use `asyncio.run(..., debug=True)` in dev to catch coroutines that never await. For production: implement a watchdog task that detects hung tasks: `hung = [t for t in asyncio.all_tasks() if time.monotonic() - t._wakeup > 60]`. Log and alert on hung tasks. Never call blocking operations in async code; use `loop.run_in_executor()` instead.
Follow-up: If a stuck task is holding resources (DB connection, file handle), how do you force cleanup without corrupting state?
You're building a WebSocket server with `asyncio`. Client connections are processed by a handler coroutine. After 10k concurrent connections, new connections accept but handlers don't run—they're queued indefinitely. CPU and memory are available. What happened to your event loop?
The event loop is overloaded with task scheduling overhead, not actual work. With 10k pending coroutines, `asyncio.get_running_loop()._ready` queue becomes a bottleneck—scheduler spends more time shuffling tasks than executing them. Symptoms: `asyncio.run()` blocks on `run_forever()`, CPU usage is low despite many tasks, new work isn't scheduled. Root cause: inefficient handler code spawning many sub-coroutines (e.g., `asyncio.create_task()` in a loop) without backpressure. Solution: (1) use `asyncio.Semaphore(max_concurrent)` to cap concurrent handlers, (2) batch operations: instead of 10k individual tasks, use `asyncio.gather()` with chunking, (3) measure event loop latency: `loop.call_soon(mark_time); ... print(marked_time - now)` to detect saturation. For WebSocket servers, use `asyncio.StreamReader` + `asyncio.StreamWriter` (lower overhead than frameworks). Profile event loop itself: `asyncio.run(..., debug=True)` logs slow callbacks. Consider `uvloop` (drop-in replacement, 2-4x faster event loop) for high-concurrency workloads. Shard across multiple event loops: run `asyncio.run()` in separate threads or processes for parallel scaling.
Follow-up: How would you design a backpressure system where accepting new connections is paused when handlers are saturated?
A real-time data pipeline uses `asyncio` to fan-out work: each input triggers 100 parallel coroutines via `asyncio.create_task()`. At 100 inputs/sec (10k concurrent tasks), memory grows by 10MB/sec despite tasks completing. Tracing shows garbage collection runs every 30 seconds. How do you fix this?
Memory fragmentation from task object allocation/deallocation churn. Asyncio creates a task object (~1KB Python object + stack frame state) per coroutine; at 10k/sec, you're allocating millions of objects. Garbage collection is lagging, so memory accumulates between GC cycles. Solutions: (1) preallocate a pool of reusable tasks—use `asyncio.Queue` to feed work to fixed-size worker pool instead of spawning new tasks per item. (2) batch inputs: instead of processing 1 input = 100 tasks, batch 10 inputs = 10 parallel tasks of 10 items each, (3) force GC tuning: `gc.set_threshold(1000, 10, 10)` to run GC more aggressively (at cost of GC overhead), (4) use `asyncio.as_completed()` or `asyncio.wait()` instead of `gather()` to release tasks as they complete, not batch-waiting. Measure with `tracemalloc`: `peak = tracemalloc.get_traced_memory()[1]`, track over time. Use `objgraph` to find unexpected reference cycles. For high-throughput pipelines, consider replacing `asyncio` with `multiprocessing.Pool` or a dedicated queue system (Kafka, Redis Streams) for fan-out.
Follow-up: How do you implement a fixed-size asyncio worker pool that scales gracefully under variable load?
Your asyncio application uses `asyncio.sleep(0)` as a yield point (cooperative multitasking). Performance profiling shows 40% of CPU time is in the scheduler, not application code. Switching to `await asyncio.sleep(0.001)` (1ms) makes it worse. How do you optimize scheduler overhead?
Excessive `asyncio.sleep(0)` calls are thrashing the scheduler. Each call triggers a context switch: queue task, suspend, resume on next loop iteration. At 1 million sleep(0) calls/sec, you're spending all CPU on scheduler bookkeeping, not work. Root cause: code treating asyncio like a threading scheduler (yielding constantly) instead of event-driven. Solution: (1) replace `sleep(0)` yields with real I/O-driven scheduling—use `await connection.read()` instead of polling, (2) use `asyncio.gather()` to batch operations: instead of yield between each task, await all at once, (3) measure actual wait time with `time.perf_counter()` around I/O: if work takes 1ms, don't yield for 0.001s. (4) for CPU-bound work disguised as async, use `loop.run_in_executor()` to offload to thread pool. Benchmark with `py-spy` to see which function calls spend most time in asyncio module. For pure event loops with minimal computation, uvloop reduces scheduler overhead by 2-4x. Consider if asyncio is the right tool: if you need simple threading with minimal I/O, sync code + threads may be faster.
Follow-up: How do you transition an asyncio codebase that overuses `sleep(0)` to true event-driven design?
A service has multiple event loops (one per thread/process) for handling different protocols (HTTP, gRPC, WebSocket). Inter-loop communication uses a shared queue. Messages are processed but sent async callbacks never fire. Thread safety debugging shows no deadlock. What's wrong?
Callbacks are being registered on the wrong event loop. Each event loop has its own callback queue; calling `loop.call_soon_threadsafe(callback)` on a different loop than the one running is safe, but forgetting to do so causes callback never executes. In multi-loop setups: (1) identify the correct loop for each callback—the one that's currently running (`asyncio.get_running_loop()`), (2) use `loop.call_soon_threadsafe()` from other threads/loops, not `loop.call_soon()` (which isn't thread-safe), (3) verify the loop is actually running—if it's blocked in `run_until_complete()`, new callbacks won't fire until the current task completes. Debug: log which loop each callback is registered on, check `loop.is_running()` before posting. For inter-loop communication, use thread-safe queues: `queue.Queue` + `loop.run_in_executor()` to wake the target loop. Better pattern: run all I/O on one event loop, use processes/threads only for CPU-bound work. If you must have multiple loops, use `asyncio.Queue` (not thread-safe) within a loop, or `queue.Queue` (thread-safe) between loops.
Follow-up: How would you design a multi-protocol server (HTTP, gRPC, WebSocket) sharing a single event loop without protocol interference?
You're profiling an asyncio application and notice `asyncio.get_event_loop()` returns a different loop on each call from background threads. Context managers use different loops mid-request. This causes cryptic "task attached to different loop" errors. How do you enforce loop consistency?
Default `asyncio.get_event_loop()` behavior is confusing: in a thread without a running loop, it creates a new event loop and caches it. Different threads = different loops. Solution: (1) explicitly pass the loop: `create_task(coro, loop=main_loop)` instead of relying on implicit `get_event_loop()`, (2) use context vars to store the "current" loop: `_loop_context = contextvars.ContextVar('loop'); _loop_context.set(loop); await some_async_work()` inside, retrieve with `loop = _loop_context.get()`, (3) in Python 3.10+, use `asyncio.get_running_loop()` which raises an error if no loop is active, forcing you to be explicit. For multi-threaded apps: start each thread with a dedicated loop: `loop = asyncio.new_event_loop(); asyncio.set_event_loop(loop); loop.run_until_complete(async_main())`. Use `asyncio.run()` which handles loop setup correctly. Never rely on implicit `get_event_loop()` in production code. Test with multiple threads in CI to catch loop binding issues early.
Follow-up: If a library function expects `asyncio.get_event_loop()` but you need to use a specific loop, how do you patch it safely?
A scheduled background task runs every 5 seconds via `loop.call_later(5, callback)`. After 24 hours, the callback stops being called even though the loop is running and other tasks execute. The delay drifts: runs at T=0, T=5, T=10, T=16, T=22, eventually skipped. What's causing timing degradation?
Callback latency is drifting due to event loop saturation or timer resolution loss. Each `call_later()` schedules a single firing; if the loop is busy when the timer fires, the callback is delayed. Worse, if callback execution takes time, next callback is further delayed (error accumulates). Root cause: no backpressure—each callback can start a cascade of work that delays subsequent timers. Solution: (1) use `asyncio.ensure_future(recurring_task())` instead of `call_later()`, which reschedules itself: `async def recurring_task(): while True: await asyncio.sleep(5); do_work(); await asyncio.sleep(5)`. (2) measure jitter: track time between expected and actual callback fire—if mean > 1s, loop is saturated. (3) for critical timers, use `signal.alarm()` or OS-level timers (less subject to application jitter), but these can only call signal handlers. (4) never do blocking work in callbacks; use `asyncio.create_task()` to offload to the event loop. Profile the event loop: `asyncio.run(..., debug=True)` logs slow callbacks (>1s). For precise timing, consider `threading.Timer` on a dedicated thread or external scheduler (APScheduler) if sub-second precision needed.
Follow-up: How would you implement a guaranteed "run-at-most-once" semantics for a periodic task that must not overlap with itself?
You're implementing a timeout mechanism: `asyncio.wait_for(task, timeout=10)`. If the task completes in 5 seconds but the timeout fires anyway at 10 seconds (creating duplicate work), what race condition could cause this?
The task completed but the timeout callback hadn't been cancelled yet—event loop scheduled both "task done" and "timeout fired" callbacks, and they raced. When task completes, `asyncio.wait_for()` cancels the timeout, but if the timeout callback is already queued, it may still execute. Edge case: if task takes 9.999 seconds and timeout is 10s, task completion and timeout are nearly simultaneous; whichever callback fires first wins, creating non-determinism. Solution: (1) use `asyncio.CancelledError` handling correctly—when `wait_for()` times out, it cancels the task and raises `asyncio.TimeoutError`, but the task's cancellation handler must execute before timeout logic runs, (2) implement a guard: cancel the task *before* declaring timeout: `task.cancel(); await asyncio.sleep(0); raise TimeoutError()` ensures task cleanup completes, (3) track task state with a flag: `completed = False; await task; completed = True` so timeout callback can check `if not completed: raise TimeoutError()`. This is why `asyncio.TaskGroup` (Python 3.11+) exists—it manages timeout + cancellation atomically. For earlier versions, avoid manual timeout logic; use libraries that get this right (e.g., `async-timeout`). Test timeout edge cases: set timeout = task_duration ± 10ms and run 10k times to find race conditions.
Follow-up: How do you implement a "soft timeout" that allows graceful cleanup vs "hard timeout" that immediately kills a task?