Python Interview — C Extensions, ctypes, and Cython

Wrapping C library with ctypes. Binding is slow: 100M calls/sec via ctypes takes 50ms, but native C takes 2ms. Why 25x overhead?

ctypes has runtime overhead: argument marshalling (Python->C types), FFI boundary crossing, error checking. Per-call ~1µs overhead. Solutions: (1) batch calls: call C function that loops internally. Pass data via numpy arrays (zero-copy). (2) Cython for hot paths: compiles to C, calls C libs with minimal overhead (10-50ns per call). (3) cache ctypes references: `func = lib.some_function` once, reuse. (4) numpy for bulk: numpy arrays are C pointers; pass directly. (5) profile with py-spy: if time in Python->C transition, batch. If in C code, need faster lib. For 100M calls: batching is best. Call `c_function_loop(data, 100M)` runs in C. Measure: time 1M calls vs C-only. If ratio >10x, overhead is Python layer.

Follow-up: How do you profile relative cost of Python->C boundary crossing vs computation?

Cython extension imports but is slow: performance similar to pure Python. Compilation happened, but speedup isn't realized. What's wrong?

Common Cython mistakes: (1) not declaring types: `def func(x): ...` is still Python dispatch. Use `cdef int func(int x):` for C. (2) using Python objects: `list`, `dict` still Python overhead. Use C arrays or numpy. (3) not compiling: `pyximport` compiles on import, adds overhead. Pre-compile with `setup.py`. (4) debugging mode: Cython compiled with `-g` is slower. Disable for production. (5) GIL not released: use `with nogil:` block to release GIL for parallelism. Solutions: (1) annotate types: `cdef int`, `cdef double[:]` (typed memoryview). (2) build setup.py with Cython, pre-compile. (3) profile with cProfile to confirm Cython function is actually called and spending time there. Test: pure Python vs Cython on same logic, measure speedup. Should be 10-100x for numeric code.

Follow-up: How do you debug why Cython code isn't faster than pure Python?

C extension compiled with NumPy C API. At scale (1M arrays), memory usage explodes 10x from pure Python. Where does extra memory go?

C extensions may hold extra memory: (1) reference cycles: C code creates cycles (A holds B, B holds A), not garbage-collected until Gen2. (2) buffer objects: numpy arrays have C buffers + Python object wrappers. Both counted in memory. (3) memory leaks: C code forgot `Py_DECREF()` or `PyMem_Free()`. Leaks accumulate. (4) alignment padding: C structs have padding for alignment, more memory than Python equiv. Solutions: (1) profile with tracemalloc: `tracemalloc.take_snapshot()` shows top allocators. (2) check for memory leaks: valgrind on C code. (3) use numpy arrays directly: they're optimized. (4) ensure Py_DECREF is balanced with Py_INCREF in C code. For 10x explosion at 1M arrays: likely leak in C code or reference cycles. Measure baseline: pure Python 1M arrays vs C extension 1M arrays. If C uses 10x, debug C code for leaks.

Follow-up: How do you detect and fix memory leaks in C extensions using valgrind?

ctypes FFI with error handling: C function returns error codes, but ctypes ignores them by default. You manually check return value but it's tedious. How do you automate error handling?

ctypes doesn't auto-raise on error codes. Solutions: (1) set error function: `lib.func.errcheck = error_handler` where error_handler raises exception if return code indicates error. (2) wrapper function: `def func_checked(...): result = lib.func(...); if result < 0: raise RuntimeError(...); return result`. (3) use ctypes.POINTER + argtypes/restype for type safety: `lib.func.argtypes = [c_int]; lib.func.restype = c_int`. (4) for C functions with output parameters, use ctypes.byref: `output = c_int(); lib.func(ctypes.byref(output))`. Best: implement error_handler once, apply to all functions. Example: `def error_handler(result, func, args): if result < 0: raise OSError(f"Error code {result}"); return result; lib.some_func.errcheck = error_handler`. This ensures all calls are checked automatically.

Follow-up: How do you implement a ctypes wrapper that auto-converts C error codes to Python exceptions?

Cython function releases GIL via `with nogil:`, allowing parallelism. But profiling shows single-threaded performance is worse (30% slower) than pure Python. Why?

Cython with GIL release has overhead: acquiring/releasing GIL costs ~1µs per call. If function is simple/fast, overhead dominates. Solutions: (1) only release GIL for long-running operations (>1ms). For short operations, GIL overhead not worth it. (2) move GIL release to outer loop: release once for 1000 iterations, not 1000 times. (3) benchmark: measure time in `with nogil:` vs without. If function takes <1ms, release may cost more than benefit. (4) profile: use cProfile to see `cython_create_loop` overhead. For parallelism benefit to outweigh GIL release cost, function must run >10ms typically. Test: implement both, benchmark single-threaded and multi-threaded. If single-threaded is slower with nogil, don't use it for this function.

Follow-up: How do you decide when to release the GIL in Cython for optimal parallelism?

You're embedding Python in a C application. C code calls Python function via ctypes. If Python raises exception, C doesn't see it—silently fails. How do you propagate errors?

Exceptions in Python don't auto-propagate to C. Solutions: (1) wrap calls: `def safe_call(...): try: return func(...); except Exception as e: return {"error": str(e)}`. C code checks for error dict. (2) use ctypes error checking: `lib.func.errcheck` to convert exception to error code C understands. (3) implement exception translator: Cython can translate Python exceptions to C error codes automatically (use `exception` spec in cdef). (4) logging: if C can't handle Python exceptions, log all exceptions on Python side for debugging. (5) separate processes: if C->Python calls are frequent, move to separate service (process), use sockets/IPC instead of direct calls. Best: wrap Python calls, return error struct/code to C, let C handle. Test: ensure exception in Python is visible to C (via error code or logging).

Follow-up: How do you implement an exception translation layer between Python and C?

Cython extension uses `cdef class` (C-level class). Instances are faster than Python classes but not as fast as hoped. Profiling shows time in attribute access (`self.x`). How do you optimize?

Cdef class attribute access still has overhead (descriptor protocol). Solutions: (1) use `cdef` attributes: `cdef public int x` directly exposes as C member, zero-overhead access. (2) avoid Python attribute lookups in inner loops: cache locally: `cdef int x = self.x; ... use x 100x`. (3) use `__slots__` for Python-level classes (if not using cdef). (4) inline methods: small methods should be inlined by Cython compiler (use `inline` keyword if needed). (5) profile attribute access separately to confirm it's bottleneck. For 10M attribute accesses: expect microseconds. If milliseconds, access is bottleneck. Solutions: (1) cache access in loop, (2) use `cdef public` attributes instead of properties.

Follow-up: How do you optimize property access in Cython for high-frequency operations?

ctypes library function is called in tight loop, 100M times/sec. Even with caching function reference, overhead is significant. You consider Cython, but integration is complex. Any other options?

For 100M calls/sec, even ctypes overhead (~1µs = 100s CPU/hour) is significant. Options: (1) Cython if integration feasible (10-100x faster). (2) call C library from C code instead of Python (move logic to C). (3) replace C lib with pure numpy/scipy if available (sometimes slower but integrated). (4) batch calls: move loop to C, call once instead of 100M times. This is usually fastest practical option. (5) async batching: collect items asynchronously, process in batch. (6) use PyO3 (Rust) or pyo3-derived extensions for faster FFI (newer alternative to ctypes/Cython). Practically: batching wins. If loop is 100M calls to C function that takes 1µs, that's 100s of CPU. If you batch into 1M chunks, 1M C calls each doing 100 iterations = 1M calls * 1µs = 1s CPU. 100x speedup from batching alone.

Follow-up: How do you refactor code to batch C library calls without changing the API?