Your Redis production cluster runs with requirepass "plaintext-password". An attacker sniffs the network and captures the password. They SSH to a Redis node and run redis-cli AUTH
Multiple failures: (1) plaintext password over network (sniffer's dream), (2) single password for all users (no granular access), (3) no encryption (TLS). Fix with ACL + TLS: (1) ACL (access control lists): instead of requirepass, use ACL to define users and permissions. Create user app_user with restricted permissions: ACL SETUSER app_user on >strong-password +get +set ~* -@all +@write. This limits app_user to GET/SET only, blocks FLUSHALL. (2) TLS encryption: CONFIG SET tls-port 6380 and provide tls-cert, tls-key files. Clients use encrypted channel (sniffing fails). (3) disable default user: ACL SETUSER default off (prevents unauthenticated access). (4) rotate passwords: set new password periodically using ACL SETUSER. (5) audit access: enable ACL LOG: ACL LOG RESET, then monitor who accessed what. Prevention: (1) network isolation: Redis should only be accessible from trusted clients (same VPC/private network). Use iptables or security groups to block external access. (2) use separate users per app: instead of single password, create app_user, reporting_user, admin_user with minimal permissions each. (3) require TLS: all connections must use TLS. Disable TCP plaintext: redis.conf has tls-replication yes to force replication over TLS. (4) implement fail2ban: block IPs after N failed auth attempts. (5) monitor AUTH failures: use redis-cli ACL LOG and alert on repeated failed AUTH. Implementation: (1) generate TLS certs: openssl genrsa -out redis.key 2048 && openssl req -x509 -new -key redis.key -out redis.crt -days 365. (2) update redis.conf: set tls-port 6380, tls-cert-file /path/redis.crt, tls-key-file /path/redis.key. (3) create users: redis-cli ACL SETUSER app_user on +@read +@write ~app:* -@all (only access keys matching app:*). (4) disable default: redis-cli ACL SETUSER default off. (5) test: redis-cli -p 6380 --tls --cert /path/client.crt --key /path/client.key AUTH app_user
Follow-up: If you rotate TLS certificates and clients connect using old certs, what happens to active connections?
You implement ACL for your Redis cluster. One ACL rule is: "allow user app_user to run all commands except FLUSHALL". You write: ACL SETUSER app_user on +@all -FLUSHALL. But a colleague tests it and discovers app_user can still run FLUSHALL! The ACL rule seems broken. What's wrong?
ACL rule is incorrect: -FLUSHALL uses command name, but @all includes all command categories. The issue: ACL evaluation is left-to-right. +@all grants all, then -FLUSHALL tries to deny a specific command. But @all might not include FLUSHALL if FLUSHALL is considered "admin". Check: ACL DRYRUN app_user FLUSHALL
Follow-up: If you need different users to have different permissions per key prefix (e.g., app_a can access app_a:* but not app_b:*), how would you set this up at scale?
Your Redis cluster uses TLS with client certificates for mutual authentication. During certificate renewal, old certs are replaced with new ones on Redis. But existing client connections using old certs are abruptly disconnected (EOF). This causes production outage. How do you rotate TLS certificates without disrupting connections?
TLS certificate rotation typically requires restart or CONFIG SET. During rotation, old clients using old certs lose connection. Zero-downtime rotation: (1) dual-cert mode: configure Redis to accept both old and new certs. Update redis.conf with both tls-cert-file and tls-key-file pointing to concatenated cert+key file that includes both. However, standard Redis doesn't support this natively. (2) gradual migration: (a) deploy new Redis instance with new certs alongside old instance. (b) clients gradually connect to new instance (using new certs). (c) once all clients migrated, shutdown old instance. (3) connection pooling on client: clients maintain connection pool. When cert renewal happens, reconnect (not graceful, but faster than app restart). Use client libraries that support auto-reconnect on SSL errors. (4) load balancer: if using proxy/LB in front of Redis, rotate certs at LB level. Clients connect through LB (with old certs), LB connects to Redis (with new certs). Update LB certs independently. (5) CONFIG SET with MODULE: some Redis Enterprise/Stack modules support hot-reload of TLS certs: CONFIG SET tls-cert-file-new /path/new-cert.crt. But standard Redis doesn't. Prevention: (1) use long cert validity (2-3 years instead of 1 year) to reduce rotation frequency. (2) plan certificate rotation: schedule during maintenance window. Notify clients. (3) implement graceful connection closure: after new certs loaded, send SHUTDOWN with timeout so clients can reconnect to other nodes. (4) use certificate pinning on client: clients verify cert fingerprint. When cert changes, clients know to reconnect. Implementation: (1) in staging, test cert rotation: apply new certs, verify clients reconnect successfully, measure downtime. (2) in production, during rotation: (a) update LB/proxy certs first, (b) then Redis certs one by one (if cluster), (c) monitor client error rates, alert if >0.1% connection failures. (3) test client resilience: redis-benchmark with cert rotation enabled, measure throughput during rotation.
Follow-up: If a client cert expires and is not renewed, how would you detect and force reconnection?
Your Redis cluster uses ACL with per-command logging (ACL LOG). After 1 week, ACL LOG has 10M entries (every denied command is logged). This is creating significant memory overhead. But you need audit trails for compliance. How do you balance logging and performance?
ACL LOG is useful for auditing but can consume memory if every command is logged. 10M entries at ~100 bytes each = 1GB of ACL LOG. Problem: (1) memory overhead, (2) performance impact of logging every command. Solutions: (1) selective logging: only log denied commands (failed AUTH, command permission denied). Don't log allowed commands. This reduces volume 100x (assuming 1% of commands are denied). Use CONFIG SET acl-log-max-len 10000 to limit to recent 10K events. (2) external audit log: instead of ACL LOG in Redis, stream denied events to external system (ELK, Splunk, CloudWatch). Use ACL LOG RESET and periodically export: redis-cli ACL LOG > audit-export.txt, upload to S3. (3) sampling: log 1 in 100 denied commands (accept some data loss for performance). Implement in app code: if rand() < 0.01: log(acl_event). (4) export periodically: run ACL LOG GET, export to file, RESET. Repeat every hour. This bounds memory. (5) Redis Enterprise feature: managed ACL logging with external storage. For standard Redis: implement (1) + (2). Prevention: (1) set acl-log-max-len early: CONFIG SET acl-log-max-len 1000 (keep only recent 1000). (2) monitor growth: INFO stats > acl_log_count. Alert if > 100K. (3) test impact: benchmark with ACL logging enabled vs disabled. Measure latency difference. If < 5% impact, keep it on. If > 5%, disable and use external logging. Implementation: (1) update redis.conf: acl-log-max-len 1000. (2) external log: EVAL 'local log = redis.call("ACL", "LOG", 100); for _, entry in ipairs(log) do log_to_external(entry) end; redis.call("ACL", "LOG", "RESET"); return 1' 0. Run every 10 minutes via cron. (3) verify: ACL LOG GET should show only recent entries. Size should be bounded to ~100 * 100 bytes = 10KB for 1000 entries.
Follow-up: If you need to audit which user accessed which keys, how would you implement comprehensive access logging?
Your Redis cluster spans 2 data centers (DC1, DC2) connected via VPN. You configure TLS and ACL, but during a network partition, DC1 and DC2 lose connectivity. DC1 Redis continues running (replicas lose connection). However, TLS handshake times out during partition (cert validation times out?). After partition heals, some replicas can't reconnect to primary (TLS errors). Why?
TLS handshake failure during partition recovery: (1) cert validation requires time (checking revocation list, etc.). If network is slow, timeout. (2) clock skew: if DC1 and DC2 clocks drift >5 minutes, cert validity check fails (cert's notBefore/notAfter times are outside local time range). (3) cert rotation during partition: if certs were rotated in DC1 while DC2 was disconnected, DC2's old certs won't match. Fix: (1) clock sync: ensure NTP syncs across DCs. ntpstat should show all nodes synchronized. (2) increase TLS handshake timeout: CONFIG SET tls-handshake-timeout 30000 (30 seconds, vs default 5). This gives more time for cert validation. (3) disable certificate revocation checking: during network issues, revocation checks might timeout. CONFIG SET tls-prefer-server-ciphers yes (disables some checks). (4) verify cert validity: on each node, openssl x509 -in redis.crt -text | grep validity. Ensure notBefore < now < notAfter. (5) manual reconnect: after partition heals, force replicas to reconnect: redis-cli -p 6380 --tls REPLICAOF NO ONE (disconnect), then REPLICAOF
Follow-up: If TLS certs use self-signed certificates and clients need to validate them, how would you distribute the CA cert to all clients?
You discover a security issue: a former employee still has access to Redis (old credentials in their laptop). They could potentially access or modify data. You immediately revoke their ACL user with ACL DELUSER
Revoking user (ACL DELUSER) doesn't kill existing connections—it only prevents new authentication. Former employee's existing connection remains alive and can execute commands (they're already authenticated). To fully revoke: (1) delete user: ACL DELUSER
Follow-up: If you need to maintain detailed audit trails of who accessed which keys and when, how would you implement this without massive overhead?