Python Interview — Profiling and Optimization

Bottleneck function takes 200ms per call. cProfile shows 80% in "other" (not attributed to functions). Flamegraph is flat. What's creating CPU time if not in functions?

cProfile misattribution happens: C extension calls, syscalls (I/O, signals), profiler overhead skewing results, tight loop in bytecode. Solutions: (1) use `py-spy` (sampling) instead of cProfile (instrumentation)—sampling shows real CPU without overhead. (2) use `perf` (Linux): `perf record -F 99 python script.py; perf report` shows CPU cycles by instruction. (3) instrument with `timeit`: `timeit.timeit('critical_code', number=10000)` isolates. (4) flamegraph with `py-spy`: `py-spy record -o profile.svg python script.py`. If "other" is 80%: suspect syscalls (trace with `strace -c`), network/disk I/O (`fuser`), C extensions. Measure instruction count: `perf stat python script.py` shows cache misses, branch mispredictions. For CPU-bound: measure L1 cache misses; if high, optimize memory access patterns.

Follow-up: How do you profile memory allocations to find where memory is spent?

Service optimized for throughput but latency p99 is high (500ms vs p50 50ms). Profiling shows optimization reduced average time but tail latency worse. What's wrong?

Throughput optimization (batching, caching) can increase latency variance. Solutions: (1) measure percentiles, not averages: `numpy.percentile(times, 99)` for p99. (2) identify latency outliers: if 1% of requests hit slow path, optimize that path separately. (3) use `histogram` to visualize distribution (most requests fast, few slow). (4) check GC pauses: if p99 has periodic spikes every N seconds, GC likely. Tune GC thresholds. (5) identify outlier cause: `if time > 200ms: log(stack, request_id)` to see what slow requests have in common. (6) tune for balanced profile: some optimizations trade throughput for latency. For production: p99 < 500ms is critical; optimize outliers specifically, not entire profile.

Follow-up: How do you detect and isolate latency outliers without logging every request?

You optimize function A, reducing per-call time 50%. Benchmark shows end-to-end improvement only 5%. Where did the speedup go?

Amdahl's Law: if A is 2% of total time, 50% reduction = 1% total improvement. If A is 20%, 50% reduction = 10% improvement. Solutions: (1) measure function contribution first: use cProfile/py-spy to find which functions consume most time. Optimize highest-impact first. (2) use Amdahl's Law: `speedup = 1 / ((1 - f) + f / s)` where f = fraction of time in A, s = speedup of A. For 50% speedup on 2% of code: `1 / (0.98 + 0.02 / 2) ≈ 1.01` (only 1% speedup). (3) profile end-to-end: measure before/after on real workload to detect other bottlenecks. For 5% improvement: likely A is small part of time. Profile to find real bottleneck.

Follow-up: How do you systematically prioritize optimization targets using profiling data?

Optimization adds complexity (caching, batching, concurrency). Code maintainability suffers. After optimization, bug fixes are 2x slower. Is the performance gain worth the maintenance cost?

Trade-off between performance and maintainability. Solutions: (1) measure both: performance gain (% improvement) vs maintainability cost (time to debug/fix). (2) simple optimizations first: algorithmic improvements (O(n²) -> O(n log n)) often gain more than complex tricks and are simpler. (3) premature optimization is root of evil: only optimize if profiling shows it's bottleneck AND cost of optimization < benefit. (4) defer optimization: ship simple version, optimize if production metrics show need. (5) isolate complex optimizations: keep complicated code in separate module, well-documented. (6) measure: does performance gain justify 2x maintenance cost? If 10% speedup and maintenance 2x slower, probably not worth. If 50% speedup, maybe worth. Principle: optimize where it matters (hot paths), keep rest simple.

Follow-up: How do you measure and track maintenance costs of complex optimizations?

Profiling in development shows function F takes 40% of CPU. In production, profiling data shows F takes only 5%. Why is production different?

Production workload differs from dev. Solutions: (1) dev uses test data (smaller, simpler) vs production data (realistic scale). (2) dev is single-process vs production multi-process load. (3) CPU cache behavior differs: dev code may have better cache locality (smaller data), production has cache thrashing. (4) profiling overhead: profiling itself adds overhead; production profiling is lower-overhead sampling. (5) profile with realistic data: load production data snapshot into dev, re-profile. If F drops to 5%, dev data is issue. Solutions: (1) test with production data (anonymized). (2) continuous profiling in production (low-overhead sampling with py-spy/asyncprofiler). (3) benchmark with realistic workload in staging. For 40% dev vs 5% prod: likely dev data is bottleneck, real production work is elsewhere.

Follow-up: How do you set up low-overhead continuous profiling in production without customer impact?

After optimization, benchmark shows 30% improvement. Code review flags: "This is hard to read. Can you simplify?" Reverting to simple version loses gains. How do you manage review feedback?

Performance vs readability trade-off. Solutions: (1) document optimization: add comments explaining why optimization is complex, what problem it solves, measurements proving value. (2) justify: "30% improvement = X seconds saved per day" makes value concrete. (3) simplify without sacrificing gains: refactor optimized code for readability while keeping key optimizations. (4) split commits: optimization commit separate from feature commit, easier to review. (5) establish performance criteria: "X% improvement required for complexity trade-off" lets team decide threshold. (6) propose alternative: sometimes simpler optimization achieves 15% with better readability. Principle: document trade-offs clearly. If optimization is justified (measured, valuable), code complexity is acceptable if well-documented.

Follow-up: How do you document performance optimizations for future maintainers?

Service scales to 10x load (10M requests/hour). Performance profiling shows different bottleneck at 10x vs 1x load. Optimization for 1x doesn't help at 10x. Why?

Bottlenecks change with scale. At 1x: CPU-bound (algorithm). At 10x: memory contention, GC pauses, lock contention, network. Solutions: (1) profile at scale: run load tests at realistic peak (10x), identify actual bottleneck. Don't optimize based on low-load profiling. (2) Littles' Law: at 10x load, queueing effects dominate. Latency = (average queue length) / (throughput). Optimize queue depth, not per-request time. (3) identify bottleneck resource: at 10x, likely resource exhaustion (CPU saturated, memory full, disk I/O maxed). Identify which. (4) scale horizontally if possible: add more servers. (5) measure: profile at multiple load levels (1x, 5x, 10x) to find when bottleneck changes. Key insight: optimization for 1x may make 10x worse (e.g., caching adds CPU overhead at 1x but avoids disk I/O at 10x).

Follow-up: How do you design benchmarks to predict bottlenecks at future scale?

Optimization reduces function call overhead by 50% via manual inlining. Benchmark shows 2% end-to-end improvement. After optimization, code is harder to refactor. Is inlining worth it?

Manual inlining for 2% gain is usually not worth maintenance cost. Better options: (1) let compiler inline: `-O2` flag enables compiler inlining for hot functions (Cython, C extensions). Python doesn't have this. (2) profile more carefully: if 50% reduction on function, but function is 4% of total time, then 50% * 4% = 2% total (as expected). Look for functions that are 20%+ of total time. (3) algorithmic improvement: usually gains more than micro-optimizations. (4) measure trade-off: 2% speedup vs maintainability cost. For most projects: not worth. For latency-critical (HFT, gaming): maybe. Principle: inline only if: (a) function is bottleneck (>10% of time), (b) inlining gains >10%, (c) code remains readable (e.g., via macros or code generation).

Follow-up: How do you identify functions worth manual inlining vs compiler optimization?