Redis Interview — Keyspace Notifications and Event-Driven Patterns

Your system uses Redis keyspace notifications (CONFIG SET notify-keyspace-events KEA) to detect when keys expire. A handler subscribes to __keyevent__:expired:* and processes cleanup (e.g., delete from database, invalidate cache). After 1 week, you discover the expiration handler is processing only 10% of actual expirations. Some keys expire silently without triggering handlers. Why?

Keyspace notifications are asynchronous and best-effort. They're not guaranteed for every expiration. Causes: (1) Redis uses lazy deletion for performance. A key is only checked for expiration when accessed. If key is never accessed and just sits in memory past its TTL, expiration notification may never fire. (2) active expiration is sampling-based. Redis background samples keys with probability (hz config). At low hz, many expirations are missed. (3) subscription lag: if notification handler is slow or disconnects briefly, some notifications are lost (Pub/Sub has no persistence). Fix: (1) increase active expiration: CONFIG SET hz 10 (vs default 1). This samples more keys per second for expiration. (2) use volatile keys efficiently: for keys you care about expiring, ensure they're accessed or set an explicit eviction policy. (3) don't rely solely on notifications: supplement with active polling. Periodically scan keys and check TTL: SCAN 0 MATCH pattern TYPE. For TTL == -1 (expired), process cleanup. (4) use background worker: instead of relying on real-time notifications, batch cleanup: every 10 minutes, export keys with TTL==-1 (expired but not deleted), process them. (5) combine with application-side tracking: when storing a key, also track it in a sorted set with expiration time as score. Periodically ZRANGEBYSCORE to find expired keys. Guaranteed delivery. Prevention: (1) monitor expired keys: set alert if expired keys > 1% of total (sign that notifications are missed). Use INFO stats > evicted_keys. (2) test notifications: set TTL on test key, subscribe to __keyevent__:expired:*, wait for expiration, verify notification fires. (3) measure notification loss: compare Redis-side expirations (INFO stats) vs application-side processed count. Alert if delta > 5%. Implementation: (1) increase hz: CONFIG SET hz 10, then CONFIG REWRITE to persist. (2) add polling: redis-cli SCAN 0 MATCH "key:*" --count 1000 | grep and check TTL. (3) batch cleanup: Lua script that finds expired keys and processes: EVAL 'local keys = redis.call("keys", ARGV[1]); for _, k in ipairs(keys) do if redis.call("ttl", k) == -1 then redis.call("del", k) end end' 0 pattern'.

Follow-up: If you use keyspace notifications for cache invalidation and the notification is delayed (arrives 5 seconds after key expires), how would you handle stale cache?

You configure keyspace notifications with CONFIG SET notify-keyspace-events KEA (to detect all events). Within 1 day, Redis memory grows by 50% despite no data being added. The subscriptions and internal data structures are consuming extra memory. Why, and how do you optimize?

Keyspace notifications create overhead: (1) for each event (key modification, expiration), Redis generates synthetic events __keyspace__ and __keyevent__. These are internal Pub/Sub messages that consume memory in subscriber buffers. (2) if many subscribers or high event rate (100K ops/sec), event buffer grows rapidly. (3) each event is ~100-500 bytes (channel name + event type). 100K events/sec * 500 bytes = 50MB/sec of buffer. Over 1 day: 50MB * 86400 seconds = unmanageable. (4) subscribers that aren't keeping up cause event buffering on Redis side. Solution: (1) filter events: instead of KEA (all events), use specific flags: KT (only key/type), or just E (expiration). CONFIG SET notify-keyspace-events E. This reduces events to only expirations (much fewer). (2) optimize subscribers: make notification handlers fast and efficient. Slow handlers cause buffering. (3) scale subscribers: run multiple handlers in parallel to process events faster. (4) use client-output-buffer-limit: reduce limit to trigger disconnection of slow clients. CONFIG SET client-output-buffer-limit "pubsub 50mb 25mb 60" (disconnect if output > 50MB or grows > 25MB in 60 seconds). (5) disable if not critical: if you're not actively using notifications, disable them: CONFIG SET notify-keyspace-events "". This saves memory. (6) use alternative: instead of keyspace notifications, use Redis Streams. CONFIG SET stream-buffer-limit for more efficient buffering. Prevention: (1) monitor memory growth: check used_memory every hour. If growing >1% per day without data additions, investigate. (2) measure event rate: count Pub/Sub events: INFO pubsub > publish_total. If > 100K/sec, consider disabling or filtering. (3) test config: enable notifications on staging with realistic load. Measure memory impact before production. Implementation: (1) change to filtered events: CONFIG SET notify-keyspace-events "E" (expiration only). (2) reduce buffer limit: CONFIG SET client-output-buffer-limit "pubsub 50mb 25mb 60". (3) benchmark: measure throughput before/after. (4) monitor: set alert on memory growth >1%/day.

Follow-up: If you disable keyspace notifications to save memory but still need expiration events, what's an alternative mechanism?

You implement an event-driven architecture where Redis keyspace notifications trigger external webhooks. When a key is modified, __keyspace__ event fires, your handler calls an external API (webhook). If the API is slow or down, the handler blocks. During this time, more notifications arrive but aren't processed (queue backs up). The system becomes unresponsive. How do you decouple notifications from API calls?

Synchronous event processing (notification -> webhook call) is fragile. If webhook is slow/down, entire system blocks. Solution: (1) asynchronous processing: notification triggers entry into a queue (Redis List or Stream). Worker processes pull from queue and call webhook. If webhook slow, queue grows, but notification system remains responsive. (2) use Redis Streams: XADD events * key_name . Separate consumer processes read from stream and call webhook. Streams buffer internally. (3) implement with queue: on notification, RPUSH webhook-queue . Separate worker: BRPOP webhook-queue 1. Worker calls API, then continues. (4) timeout and retry: if webhook call times out (>5 seconds), add back to queue for retry. Don't block on failure. (5) circuit breaker: if webhook fails repeatedly, disable it temporarily (circuit open). Resume after backoff. Prevents cascading failures. (6) dead-letter queue: if webhook call fails after N retries, move to dead-letter-queue for manual inspection. Implementation: (1) keyspace notification subscriber (minimal): on event, RPUSH webhook:queue . Return immediately. (2) async worker: while true: BRPOP webhook:queue 1 -> event_data. Call webhook(event_data) with 5-second timeout. If success, continue. If timeout/fail, retry: RPUSH webhook:queue:retry with delay. (3) dead-letter: after 3 retries, RPUSH webhook:queue:dead-letter. (4) monitor queue depth: LLEN webhook:queue. Alert if > 10K (backlog). (5) test: simulate slow webhook (sleep 10s before response), verify notification system remains responsive. For production: (1) deploy async worker on separate machine (not main app). (2) scale workers: 1 worker can handle ~1K webhooks/sec (assuming 1ms latency). If 10K events/sec, deploy 10 workers. (3) implement exponential backoff for retries: retry after 1s, 2s, 4s, 8s (max). Prevents thundering herd if API recovers slowly.

Follow-up: If you need guaranteed delivery of webhook events (no losses even if system crashes), how would you implement this?

You use Redis keyspace notifications for analytics: every SET on key pattern "analytics:*" triggers a subscriber that increments counters in a time-series database. After 1 week, you discover that counters are off by ~5-10% (actual events vs counted events). The set operations are happening, but some aren't triggering notifications. Why the data loss?

5-10% data loss in keyspace notifications suggests events are being dropped. Root causes: (1) subscription buffering: if counter-increment is slow, notifications queue up and overflow. Once buffer hits limit, Redis drops events. (2) subscription lag: if there are network delays or processing delays, old notifications are lost due to Pub/Sub's fire-and-forget nature. (3) Redis restarts: if Redis crashes, all in-flight notifications are lost. (4) subscriber crashes: if counter-increment service crashes, notifications are lost while down. (5) misconfigured notify-keyspace-events: ensure it's set to at least "K" (keyspace events) or "E" (expiration events), depending on what you need. Fix: (1) measure notification loss: create test key with known modification count. Compare Redis-side modifications vs application-side notifications received. (2) use Redis Streams instead: XADD to persist events. This has durability, unlike Pub/Sub. (3) verify subscription: from notifier, PUBLISH test-event "test" and verify subscriber receives it. (4) increase buffer: CONFIG SET client-output-buffer-limit "pubsub 100mb 50mb 60" to allow larger buffers. (5) disable notifications if not critical: if 5-10% accuracy is acceptable, keep notifications. If you need 99.99% accuracy, switch to Streams. Prevention: (1) implement end-to-end test: set key, verify notification received, counter incremented. Run daily. Alert on failure. (2) compare counters: periodically run SCAN analytics:* and count. Compare vs time-series counter. Alert if delta > 1%. (3) use persistent queue: store events in Redis Stream before notifying: XADD analytics-events * before PUBLISH. If Pub/Sub drops, you still have Stream as backup. Implementation: (1) wrap SET: before returning, XADD to stream, then PUBLISH. (2) subscriber reads from both SUBSCRIBE (real-time) and Stream (replay missed). (3) track last-read-id from Stream, resume from there on restart. Test: simulate subscriber crash, verify events are replayed from Stream.

Follow-up: If you use Streams for persistence but also need real-time Pub/Sub, how would you synchronize both?