Python Interview Questions

Security and Common Vulnerabilities

questions
Scroll to track progress

Production Scenario Interview Questions

Your API accepts user input and generates SQL: `query = f"SELECT * FROM users WHERE id = {user_id}"`. Code review flags SQL injection risk. You argue: "We're already validating user_id as integer in the request schema." Why is this argument incomplete?

Input validation alone isn't sufficient defense against SQL injection. Issues: (1) validation is a gate but doesn't prevent injection if validation is bypassed or incomplete, (2) validation must be applied everywhere, not just at API boundary—if internal code constructs SQL from unvalidated sources, it's vulnerable, (3) string interpolation bypasses parameterization—the database driver can't distinguish user data from SQL commands. Solutions: (1) always use parameterized queries (prepared statements): `cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))` instead of f-strings, (2) use ORM (SQLAlchemy, Django ORM) which parameterizes by default, (3) validation is defense-in-depth: validate AND parameterize, (4) if validation must happen, ensure it's strict: `int(user_id)` for integers, regex for strings, allowlists for status fields, (5) use static analysis tools (bandit, semgrep) to catch f-string SQL at code review time, (6) test with injection payloads: `id = "1 OR 1=1"` should fail if validation works, but also shouldn't execute SQL injection. Example secure code: `db.execute(text("SELECT * FROM users WHERE id = :id"), {"id": user_id})`. Never trust validation alone—use parameterization as the primary defense, validation as secondary. Tools: SQLAlchemy uses parameterization by default; never fall back to string interpolation even "if validated."

Follow-up: What's the difference between parameterized queries and validation? Can you parameterize dynamic table/column names, or only values? What does parameterization look like in different databases (PostgreSQL, MySQL, SQLite)?

Your file upload handler saves user files to disk: `open(f"/uploads/{filename}", "wb").write(data)`. Security team flags path traversal risk. You respond: "Filename comes from Content-Disposition header, and we validate it's alphanumeric." Is this safe?

Path traversal attacks use encoded/special characters to escape directories. Issues: (1) if validation allows spaces/dots (e.g., `file.pdf` is alphanumeric), attacker can upload `../../../etc/passwd` or `....//....//etc/passwd` (URL encoding bypasses simple checks), (2) Unicode normalization: `café.txt` might normalize differently on different filesystems, enabling bypass, (3) case-insensitive filesystems: Windows allows `../` as `..\`, bypassing checks, (4) symlinks: attacker uploads symlink pointing outside /uploads directory. Solutions: (1) use os.path.basename() and os.path.join() safely: `safe_path = os.path.join("/uploads", os.path.basename(filename))`, (2) validate strictly: allowlist only safe characters: `re.match(r'^[a-zA-Z0-9._-]+$', filename)`, (3) use uuid for uploaded files: `import uuid; safe_name = f"{uuid.uuid4().hex}.pdf"` ignores user filename entirely—best practice, (4) verify final path is within /uploads: `os.path.abspath(final_path).startswith("/uploads")`, (5) avoid symlinks: disable or check with `os.path.islink()`, (6) use pathlib.Path for safer operations: `Path("/uploads") / filename` (validates traversal), (7) test with payloads: `../`, `..\`, `...//...//`, URL-encoded `%2e%2e%2f` to verify your validation catches them. Example secure code: `safe_name = uuid.uuid4().hex; final_path = Path("/uploads") / safe_name; assert final_path.parent == Path("/uploads")`. Never trust filenames—regenerate or use strict allowlists.

Follow-up: How do URL encoding and Unicode normalization defeat filename validation? What's the pathlib.Path approach to safe file handling? Should you ever use user-provided filenames?

Your web app uses pickle to deserialize cached objects: `cache.get(key) → pickle.loads(data)`. An attacker submits a crafted pickle payload and gains RCE. How does pickle enable code execution?

Pickle allows arbitrary code execution because it can reconstruct any Python object, including objects with __reduce__ methods that execute code. Vulnerability: (1) pickle.loads() reconstructs objects by calling class constructors and __setstate__ methods—if a malicious class is pickled, its methods execute, (2) pickle supports protocol opcodes like REDUCE that call arbitrary functions: `pickle.REDUCE` opcode calls `__reduce__()` which can execute __import__ or os.system, (3) attacker can serialize a custom class or use built-in classes with dangerous __reduce__ (e.g., os.system via functools.partial). Solutions: (1) never unpickle untrusted data—NEVER, (2) use safer serialization formats: JSON, MessagePack, Protocol Buffers, (3) if you must use pickle, restrict to trusted internal caches only, (4) implement custom Unpickler with restricted_loads() via find_class() override to whitelist safe classes: `class RestrictedUnpickler(pickle.Unpickler): def find_class(self, module, name): if module not in ALLOWED_MODULES: raise pickle.UnpicklingError(...); return super().find_class(module, name)`, (5) use restricted_loads() from multiprocessing if available (Python 3.4+), (6) consider bandit or semgrep rules to catch pickle.loads() calls on user input. Example vulnerable code: `pickle.loads(user_input)` → can execute arbitrary code. Example safe code: `json.loads(user_input)` → only deserializes JSON primitives, no code execution. For caching: use JSON or msgspec for serialization, not pickle, unless cache is absolutely internal and untrusted input never enters it. Test: try pickle gadgets (ysoserial-like payloads) and verify RCE is blocked.

Follow-up: How do pickle gadgets work? What are ysoserial payloads? Can you safely restrict pickle to certain classes? What's the difference between pickle protocols 0-5?

Your app has a debug endpoint: `GET /debug?code=print(x)` that executes user code for live debugging. You add safeguards: remove dangerous imports (os, sys), whitelist built-ins. Attacker still gains RCE. What's the flaw?

Blacklisting is inherently vulnerable—attackers find bypasses. Issues: (1) if you block `import os`, attacker uses `__import__('os')` or `importlib.import_module('os')`, (2) if you block builtins, attacker accesses them via `().__class__.__bases__[0].__subclasses__()` or similar dunder chains, (3) multiline code, comments, encoding declarations enable obfuscation, (4) eval/exec is fundamentally unsafe—you can't really sandbox it at the Python level. Solutions: (1) never allow user code execution in production—remove debug endpoints before deployment, (2) if debugging needed, use actual debuggers (pdb, remote debugger) with authentication, not eval, (3) if eval is unavoidable (e.g., DSL evaluation), use restricted_eval via RestrictedPython library: `compile_restricted_exec(user_code)` restricts dangerous operations, (4) use a sandboxed process: run untrusted code in subprocess with seccomp/pledge (OS-level sandbox), not Python-level protection, (5) audit with static analysis: use ast.parse() to scan for dangerous nodes before eval, (6) whitelist operations, not blacklist: define exactly what's allowed (e.g., math ops only) and reject everything else. Example: `from RestrictedPython import compile_restricted; code = compile_restricted(user_code, '', 'exec'); if code.errors: raise ValueError(...); exec(code, {'__builtins__': {}}, {})` prevents imports and dangerous operations. Testing: try `__import__('os')`, ` ().__class__`, `exec`, `open` and verify they're blocked. Better solution: don't use eval—use parsers and interpreters for safe DSLs (JSON, YAML, expr).

Follow-up: How do Python dunder chains like ().__class__.__bases__[0].__subclasses__() bypass restrictions? What does RestrictedPython do? Is there a truly safe sandbox in Python?

Your authentication system stores passwords as SHA1(password). You argue: "SHA1 is a cryptographic hash, it's one-way." Security audit fails you. What's wrong with this approach?

Cryptographic hash ≠ password hash. Issues: (1) SHA1 is fast—designed for checksums, not passwords, allowing brute-force attacks, (2) SHA1 collisions exist—not suitable for security, (3) rainbow tables: precomputed SHA1(common_passwords) cracks 90% of passwords in seconds, (4) no salt: identical passwords hash to same value, enabling pattern detection, (5) no stretching: adding computational delay prevents brute-force. Solutions: (1) use password hashing algorithms: bcrypt, scrypt, argon2—designed for passwords, (2) bcrypt is standard: `import bcrypt; hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt()); bcrypt.checkpw(password.encode(), hashed)`, (3) argon2 is modern and robust: `from argon2 import PasswordHasher; ph = PasswordHasher(); hashed = ph.hash(password); ph.verify(hashed, password)`, (4) use salt automatically (bcrypt/argon2 include it), (5) use work factors to make hashing slow: bcrypt rounds=12, argon2 time_cost=3, memory_cost=65536, (6) never use MD5, SHA1, SHA256 for passwords—use dedicated password hashing. Example vulnerable code: `hashlib.sha1(password).hexdigest()` → broken. Example secure code: `argon2.PasswordHasher().hash(password)` → safe. Benchmark: password hashing should take 100-500ms per attempt (intentionally slow); bcrypt rounds=12 is typical. Testing: use hashcat/john to verify hashes can't be cracked in reasonable time. Migrate existing passwords: on next login, hash with bcrypt and store new hash.

Follow-up: What's the difference between bcrypt, scrypt, and argon2? How do salts prevent rainbow tables? Should you ever upgrade password hashes after user login?

Your framework uses pickle for session storage: `session_data = pickle.dumps({'user_id': 123})` stored in cookie. User modifies the cookie and changes user_id to 999. How do you prevent session tampering?

Session data in cookies without authentication is vulnerable to tampering. Issues: (1) cookies are client-side data—user can modify them, (2) pickle can execute code (separate vulnerability), (3) even with safe serialization (JSON), attacker can change values if no integrity check exists. Solutions: (1) sign cookies with HMAC: `hmac.new(secret_key, session_data, hashlib.sha256).digest()` and verify signature before using, (2) use framework support: Flask has secure cookies with signing built-in: `from flask import session` handles this, (3) use JWT (JSON Web Tokens) for stateless sessions: `jwt.encode({'user_id': 123}, secret_key)` includes signature, (4) store session server-side (most secure): cookie contains only session_id, actual data is in server cache (Redis, database), (5) use httponly + secure flags on cookies to prevent JavaScript access, (6) disable pickle entirely—use JSON or JSON web tokens. Example secure approach: `import hmac, hashlib; data = json.dumps({'user_id': 123}); sig = hmac.new(secret_key, data.encode(), hashlib.sha256).digest(); cookie_value = f"{data}.{sig}"; # on read: verify_sig() before using`. Most secure: use JWT library: `jwt.encode({'user_id': 123}, secret_key, algorithm='HS256')` handles signing transparently. Testing: modify cookie, verify signature check fails; modify user_id in JWT, verify decoding fails. Avoid pickle for sessions—too risky.

Follow-up: How does HMAC signing prevent tampering? What's the difference between signing and encryption? Should session data ever be encrypted in cookies?

Your API logs user input for debugging: `logger.info(f"User submitted: {user_input}")`. An attacker submits a payload containing credit card numbers, and they're logged in plaintext. How do you prevent sensitive data in logs?

Logging sensitive data creates a secondary vulnerability. Issues: (1) logs are often stored unencrypted on disk, cloud storage, and accessed by developers, (2) if log file is breached, sensitive data is exposed, (3) compliance (PCI-DSS, GDPR) forbids logging certain data, (4) detecting sensitive data in logs is hard—credit cards, SSNs, tokens can be embedded in structured or unstructured data. Solutions: (1) never log sensitive data intentionally—design code to avoid it, (2) redact at logging layer: use structured logging (JSON) and redact known sensitive fields before writing, (3) use logging filters: `class SensitiveFilter(logging.Filter): def filter(self, record): record.msg = redact_sensitive_data(record.msg); return True; logging.getLogger().addFilter(SensitiveFilter())`, (4) scan for patterns: use regex to find credit card patterns (16 digits) and redact: `re.sub(r'\d{16}', 'XXXX-XXXX-XXXX-XXXX', log_message)`, (5) use PII detection libraries: identify and redact automatically, (6) never log request/response bodies verbatim—extract only needed fields, (7) encrypt logs at rest (filesystem encryption, encrypted S3 bucket), (8) limit log access: use RBAC to restrict who can read production logs. Example: `def redact(msg): return re.sub(r'\d{16}', 'XXXX-XXXX-XXXX-XXXX', msg); logger.info(redact(f"User data: {data}"))`. Better: use structured logging: `logger.info("user_data", user_id=123) # don't log card directly`. Testing: log a test credit card number, verify it's redacted in output. Audit: search logs for patterns (CC numbers, SSNs) and verify none are present.

Follow-up: How do you implement automatic PII redaction in logging? What patterns should you detect? Should sensitive data ever be logged for debugging?

Your data processing pipeline reads Python config files with `exec(open('config.py').read())`. This is flexible and works well. Security team demands it be replaced. Why is exec() on config files dangerous?

exec() allows arbitrary code execution, even if the config file is "trusted." Issues: (1) if config file is accidentally committed from dev machine, it contains arbitrary Python code—attacker can modify it to run any code, (2) config file might be generated from user input (template, GUI, API) which introduces injection, (3) dependency: if config file imports untrusted modules, RCE happens, (4) version control: config history can be queried—if compromised, attacker sees all historical configs. Solutions: (1) use safe config formats: JSON, YAML, TOML—only deserialize data structures, not code, (2) Python example: `import json; config = json.load(open('config.json'))` only parses JSON, safe from code execution, (3) for Python-like syntax, use ast.literal_eval(): `ast.literal_eval(open('config.py').read())` parses literals only (no function calls), (4) use dedicated config libraries: hydra, pydantic for safe parsing with validation, (5) if you need conditional config, use environment variables or flags, not code, (6) separate code from config: make config data-only (JSON/YAML), generate code from config if needed. Example vulnerable: `exec(open('config.py').read())`. Example safe: `import json; config = json.load(open('config.json'))` or `import toml; config = toml.load('config.toml')`. Testing: try injecting code into config file and verify it doesn't execute. Use security scanning (bandit) to catch exec() calls. Migrate: replace exec() with JSON/YAML parsing, extract config variables, pass to application.

Follow-up: What's the difference between exec() and eval()? What does ast.literal_eval() allow/disallow? How do you securely parse Python-like config syntax?

Want to go deeper?