Redis Interview — ACL, TLS, and Security Hardening

Your Redis production cluster runs with requirepass "plaintext-password". An attacker sniffs the network and captures the password. They SSH to a Redis node and run redis-cli AUTH , gaining access. They execute FLUSHALL and destroy the database. How do you prevent this?

Multiple failures: (1) plaintext password over network (sniffer's dream), (2) single password for all users (no granular access), (3) no encryption (TLS). Fix with ACL + TLS: (1) ACL (access control lists): instead of requirepass, use ACL to define users and permissions. Create user app_user with restricted permissions: ACL SETUSER app_user on >strong-password +get +set ~* -@all +@write. This limits app_user to GET/SET only, blocks FLUSHALL. (2) TLS encryption: CONFIG SET tls-port 6380 and provide tls-cert, tls-key files. Clients use encrypted channel (sniffing fails). (3) disable default user: ACL SETUSER default off (prevents unauthenticated access). (4) rotate passwords: set new password periodically using ACL SETUSER. (5) audit access: enable ACL LOG: ACL LOG RESET, then monitor who accessed what. Prevention: (1) network isolation: Redis should only be accessible from trusted clients (same VPC/private network). Use iptables or security groups to block external access. (2) use separate users per app: instead of single password, create app_user, reporting_user, admin_user with minimal permissions each. (3) require TLS: all connections must use TLS. Disable TCP plaintext: redis.conf has tls-replication yes to force replication over TLS. (4) implement fail2ban: block IPs after N failed auth attempts. (5) monitor AUTH failures: use redis-cli ACL LOG and alert on repeated failed AUTH. Implementation: (1) generate TLS certs: openssl genrsa -out redis.key 2048 && openssl req -x509 -new -key redis.key -out redis.crt -days 365. (2) update redis.conf: set tls-port 6380, tls-cert-file /path/redis.crt, tls-key-file /path/redis.key. (3) create users: redis-cli ACL SETUSER app_user on +@read +@write ~app:* -@all (only access keys matching app:*). (4) disable default: redis-cli ACL SETUSER default off. (5) test: redis-cli -p 6380 --tls --cert /path/client.crt --key /path/client.key AUTH app_user

Follow-up: If you rotate TLS certificates and clients connect using old certs, what happens to active connections?

You implement ACL for your Redis cluster. One ACL rule is: "allow user app_user to run all commands except FLUSHALL". You write: ACL SETUSER app_user on +@all -FLUSHALL. But a colleague tests it and discovers app_user can still run FLUSHALL! The ACL rule seems broken. What's wrong?

ACL rule is incorrect: -FLUSHALL uses command name, but @all includes all command categories. The issue: ACL evaluation is left-to-right. +@all grants all, then -FLUSHALL tries to deny a specific command. But @all might not include FLUSHALL if FLUSHALL is considered "admin". Check: ACL DRYRUN app_user FLUSHALL to test if user can run it. If returns error, ACL is working. If returns OK, the issue is @all doesn't cover FLUSHALL. Solution: (1) use explicit categories: +@read +@write +@key excludes admin commands. FLUSHALL is @admin. (2) use explicit deny: +@all -@admin -FLUSHALL (explicitly deny admin category first). (3) verify with: ACL DRYRUN app_user FLUSHALL somekey. Response should be error "User default has no permissions to run 'flushall' command" or similar. (4) check actual ACL: ACL GETUSER app_user to see parsed permissions. If +@all appears, it might override -FLUSHALL depending on order. Prevention: (1) use positive permissions (what user CAN do) rather than negative (what they CAN'T). Example: +@read +@write +@connection is safer than +@all minus exclusions. (2) organize by role: CREATE ACL ROLES instead of per-user rules. Role:app = +@read +@write ~app:*. Then assign users to roles: ACL SETUSER app_user on >password +role:app. (3) test all ACL rules: redis-cli ACL DRYRUN for critical commands before deploying. (4) audit: redis-cli ACL LOG to monitor denied command attempts. Implementation: (1) create safe-app role: ACL SETUSER default off (disable root). ACL SETUSER app_user on +@read +@write ~app:* (limit to read/write on app:* keys). (2) test: ACL DRYRUN app_user FLUSHALL somekey should error. (3) verify: app_user can run: DRYRUN app_user GET app:user:123, should succeed.

Follow-up: If you need different users to have different permissions per key prefix (e.g., app_a can access app_a:* but not app_b:*), how would you set this up at scale?

Your Redis cluster uses TLS with client certificates for mutual authentication. During certificate renewal, old certs are replaced with new ones on Redis. But existing client connections using old certs are abruptly disconnected (EOF). This causes production outage. How do you rotate TLS certificates without disrupting connections?

TLS certificate rotation typically requires restart or CONFIG SET. During rotation, old clients using old certs lose connection. Zero-downtime rotation: (1) dual-cert mode: configure Redis to accept both old and new certs. Update redis.conf with both tls-cert-file and tls-key-file pointing to concatenated cert+key file that includes both. However, standard Redis doesn't support this natively. (2) gradual migration: (a) deploy new Redis instance with new certs alongside old instance. (b) clients gradually connect to new instance (using new certs). (c) once all clients migrated, shutdown old instance. (3) connection pooling on client: clients maintain connection pool. When cert renewal happens, reconnect (not graceful, but faster than app restart). Use client libraries that support auto-reconnect on SSL errors. (4) load balancer: if using proxy/LB in front of Redis, rotate certs at LB level. Clients connect through LB (with old certs), LB connects to Redis (with new certs). Update LB certs independently. (5) CONFIG SET with MODULE: some Redis Enterprise/Stack modules support hot-reload of TLS certs: CONFIG SET tls-cert-file-new /path/new-cert.crt. But standard Redis doesn't. Prevention: (1) use long cert validity (2-3 years instead of 1 year) to reduce rotation frequency. (2) plan certificate rotation: schedule during maintenance window. Notify clients. (3) implement graceful connection closure: after new certs loaded, send SHUTDOWN with timeout so clients can reconnect to other nodes. (4) use certificate pinning on client: clients verify cert fingerprint. When cert changes, clients know to reconnect. Implementation: (1) in staging, test cert rotation: apply new certs, verify clients reconnect successfully, measure downtime. (2) in production, during rotation: (a) update LB/proxy certs first, (b) then Redis certs one by one (if cluster), (c) monitor client error rates, alert if >0.1% connection failures. (3) test client resilience: redis-benchmark with cert rotation enabled, measure throughput during rotation.

Follow-up: If a client cert expires and is not renewed, how would you detect and force reconnection?

Your Redis cluster uses ACL with per-command logging (ACL LOG). After 1 week, ACL LOG has 10M entries (every denied command is logged). This is creating significant memory overhead. But you need audit trails for compliance. How do you balance logging and performance?

ACL LOG is useful for auditing but can consume memory if every command is logged. 10M entries at ~100 bytes each = 1GB of ACL LOG. Problem: (1) memory overhead, (2) performance impact of logging every command. Solutions: (1) selective logging: only log denied commands (failed AUTH, command permission denied). Don't log allowed commands. This reduces volume 100x (assuming 1% of commands are denied). Use CONFIG SET acl-log-max-len 10000 to limit to recent 10K events. (2) external audit log: instead of ACL LOG in Redis, stream denied events to external system (ELK, Splunk, CloudWatch). Use ACL LOG RESET and periodically export: redis-cli ACL LOG > audit-export.txt, upload to S3. (3) sampling: log 1 in 100 denied commands (accept some data loss for performance). Implement in app code: if rand() < 0.01: log(acl_event). (4) export periodically: run ACL LOG GET, export to file, RESET. Repeat every hour. This bounds memory. (5) Redis Enterprise feature: managed ACL logging with external storage. For standard Redis: implement (1) + (2). Prevention: (1) set acl-log-max-len early: CONFIG SET acl-log-max-len 1000 (keep only recent 1000). (2) monitor growth: INFO stats > acl_log_count. Alert if > 100K. (3) test impact: benchmark with ACL logging enabled vs disabled. Measure latency difference. If < 5% impact, keep it on. If > 5%, disable and use external logging. Implementation: (1) update redis.conf: acl-log-max-len 1000. (2) external log: EVAL 'local log = redis.call("ACL", "LOG", 100); for _, entry in ipairs(log) do log_to_external(entry) end; redis.call("ACL", "LOG", "RESET"); return 1' 0. Run every 10 minutes via cron. (3) verify: ACL LOG GET should show only recent entries. Size should be bounded to ~100 * 100 bytes = 10KB for 1000 entries.

Follow-up: If you need to audit which user accessed which keys, how would you implement comprehensive access logging?

Your Redis cluster spans 2 data centers (DC1, DC2) connected via VPN. You configure TLS and ACL, but during a network partition, DC1 and DC2 lose connectivity. DC1 Redis continues running (replicas lose connection). However, TLS handshake times out during partition (cert validation times out?). After partition heals, some replicas can't reconnect to primary (TLS errors). Why?

TLS handshake failure during partition recovery: (1) cert validation requires time (checking revocation list, etc.). If network is slow, timeout. (2) clock skew: if DC1 and DC2 clocks drift >5 minutes, cert validity check fails (cert's notBefore/notAfter times are outside local time range). (3) cert rotation during partition: if certs were rotated in DC1 while DC2 was disconnected, DC2's old certs won't match. Fix: (1) clock sync: ensure NTP syncs across DCs. ntpstat should show all nodes synchronized. (2) increase TLS handshake timeout: CONFIG SET tls-handshake-timeout 30000 (30 seconds, vs default 5). This gives more time for cert validation. (3) disable certificate revocation checking: during network issues, revocation checks might timeout. CONFIG SET tls-prefer-server-ciphers yes (disables some checks). (4) verify cert validity: on each node, openssl x509 -in redis.crt -text | grep validity. Ensure notBefore < now < notAfter. (5) manual reconnect: after partition heals, force replicas to reconnect: redis-cli -p 6380 --tls REPLICAOF NO ONE (disconnect), then REPLICAOF 6379 (reconnect). This triggers fresh TLS handshake. Prevention: (1) test partition scenarios: simulate network down, measure recovery time. (2) use intermediate CA certs: instead of direct leaf certs, use CA-signed certs. This allows easier rotation. (3) implement health checks: periodically verify TLS connection is working: redis-cli PING over TLS. Alert if fails. (4) monitor TLS errors: check Redis log for TLS-related errors. Alert if > 1 per hour. Implementation: (1) sync clocks first: timedatectl set-ntp true on all nodes. Verify: timedatectl status should show synchronized. (2) increase timeout: CONFIG SET tls-handshake-timeout 30000. (3) after partition recovery, force reconnect: run script that checks ROLE on replicas. If disconnected, trigger manual REPLICAOF to reconnect.

Follow-up: If TLS certs use self-signed certificates and clients need to validate them, how would you distribute the CA cert to all clients?

You discover a security issue: a former employee still has access to Redis (old credentials in their laptop). They could potentially access or modify data. You immediately revoke their ACL user with ACL DELUSER . But they're still connected (existing connection is still alive). Do you need to kill active connections, or is revoking the user sufficient?

Revoking user (ACL DELUSER) doesn't kill existing connections—it only prevents new authentication. Former employee's existing connection remains alive and can execute commands (they're already authenticated). To fully revoke: (1) delete user: ACL DELUSER . (2) kill existing connections: CLIENT LIST | grep to find connection IDs, then CLIENT KILL ID . (3) to kill all connections from specific user, use: CLIENT KILL USERNAME . (4) force disconnect on next command: if you can't kill immediately, set user to "off": ACL SETUSER off. This prevents new connections but doesn't kill existing. Next command from existing connection will fail (user is "off"). (5) comprehensive revocation: combine ACL DELUSER + CLIENT KILL to ensure complete removal. Then BGSAVE and commit to persistence (so on restart, user doesn't come back). Prevention: (1) audit active connections: CLIENT LIST and check for unexpected users. Alert if unknown user connected. (2) implement session timeout: CONFIG SET timeout 300 (disconnect idle clients after 300 seconds). Prevents indefinite connection abuse. (3) rotate credentials regularly: change passwords/ACLs for all users monthly. (4) implement IP whitelisting: ACL rules can restrict by IP (if using Redis module). (5) monitor AUTH events: use ACL LOG to detect unusual access patterns. Alert if same user connects from different IPs. Implementation: (1) identify employee's user: grep in auth logs or ACL LOG. (2) check active connections: CLIENT LIST | grep . (3) kill connections: for each connection, CLIENT KILL ID . (4) delete user: ACL DELUSER . (5) verify: CLIENT LIST should not show any connections from . (6) audit: SCAN all keys (SCAN 0) and spot-check for unexpected modifications during employee's access period.

Follow-up: If you need to maintain detailed audit trails of who accessed which keys and when, how would you implement this without massive overhead?