Redis Interview — Redis Streams and Consumer Groups

You're using Redis Streams with a consumer group. One consumer processes messages and crashes after XREAD (before XACK). The message is stuck in the pending entry list (PEL). After 30 minutes, the message is still unack'd and no other consumer picks it up. Your app loses messages silently. How do you prevent this?

Crashed consumer leaves message in PEL indefinitely. Prevention: (1) set consumer timeout: XREAD BLOCK 5000 (5-second read timeout). If consumer doesn't ack within 5 seconds, consumer is assumed dead. But the consumer group needs to detect this. Use XAUTOCLAIM to automatically reassign stuck messages: XAUTOCLAIM 60000 0-0 (auto-claim messages older than 60 seconds to new consumer). This requires background process to run XAUTOCLAIM periodically. (2) implement heartbeat: have consumer send PING to a health-check endpoint every 10 seconds. If consumer crashes, heartbeat stops. Management process detects dead consumer and XPENDING shows unclaimed messages. Then run XCLAIM to reassign: XCLAIM 0 FORCE to forcibly move message. (3) use consumer timeout per group: XGROUP CREATE with ENTRIESREAD parameter (Redis 7.0+) tracks consumer activity. If consumer hasn't read messages in timeout period, it's considered dead. (4) automatic rebalancing: use XPENDING to monitor PEL. If PEL size for any message > 60 seconds, alert ops to manually inspect. Implementation: run redis-cli XPENDING mystream mygroup -idle 60000 to get messages idle >60 seconds. Then XCLAIM to reassign. Monitor with: XINFO CONSUMERS mystream mygroup to see consumer list and their last-delivered-id. If last-delivered-id is stale (older than current time - timeout), consider consumer dead.

Follow-up: If a message is acknowledged by a consumer but crashes before it can be processed, how would you detect and replay this message?

You have 10 consumer group consumers processing a high-volume stream (100K messages/sec). Load is unevenly distributed: consumer-1 has 50K pending messages, consumer-10 has 500. consumer-1 is falling behind, building up lag. Meanwhile, new consumers need to be added but won't catch up to the backlog. How do you rebalance?

Uneven consumer load typically results from: (1) message key affinity: if messages for same user are routed to same consumer (via XREAD with pattern), one consumer might get hot keys. (2) consumer speed differences: some consumers are slow (network, computation). (3) missing rebalancing logic: consumers never move or redistribute. Fix: (1) add new consumers to consumer group: XGROUP CREATECONSUMER mystream mygroup consumer-11. They'll pick up new messages but won't get backlog. (2) rebalance existing backlog: use XPENDING to see distribution, then XCLAIM to reassign: XPENDING mystream mygroup shows all pending messages. For messages assigned to consumer-1, run XCLAIM mystream mygroup consumer-2 0 to move to consumer-2. (3) use XAUTOCLAIM with min-idle-time: run XAUTOCLAIM mystream mygroup consumer-11 300000 0-0 (claim messages idle >5 minutes). New consumers get backlog automatically. (4) scale horizontally: don't fix load imbalance, just add more consumers and let backlog process naturally. Each new consumer reads from $ (new messages), so backlog of old messages is distributed. (5) implement smart routing: use XREAD with COUNT to limit messages per consumer per batch: XREAD COUNT 1000 STREAMS mystream . Consumer pulls 1000 messages, processes, acks. Next consumer pulls 1000 more. This distributes load. Monitor with: XINFO GROUPS mystream to see group info, consumers list. Track consumer lag: XINFO CONSUMERS mystream mygroup to see pending-count per consumer. Alert if any consumer has >10K pending for >10 minutes.

Follow-up: If you add new consumers but they miss the initial stream content, how would you replay messages for them without reprocessing old messages in existing consumers?

Your stream stores 1M messages, each 1KB. You discover the stream is being used by 2 different consumer groups: group-A (payment processing) and group-B (analytics). group-A has processed up to message 500K, group-B is only at 100K. You want to delete old messages (< 100K) to save space, but group-B will lose access to messages it hasn't processed. How do you handle this?

Streams don't support per-consumer retention—once you trim a stream, all consumers lose access to trimmed messages. Conflict: group-A wants to trim old messages, group-B still needs them. Solutions: (1) read before deleting: before trimming, export messages that group-B hasn't processed to separate storage (S3, DynamoDB). Export via: XRANGE mystream 0 | save-to-s3. Then XTRIM to delete. (2) separate streams per group: instead of 2 groups on 1 stream, use 2 streams: stream-A, stream-B. Each group consumes from its own stream. Sync between streams: when new message arrives on stream-primary, XADD to both stream-A and stream-B. This allows independent retention. (3) fast-forward slow consumer: catch up group-B to group-A before trimming. Run XCLAIM to reassign backlog to faster consumers, or temporarily increase group-B consumer count. Once both groups are at 500K, trim safely. (4) versioned streams: archive old messages to stream-v1-archive, keep recent in stream-v1. New messages go to stream-v1, old messages don't. Consumers first read from stream-v1, then fallback to stream-v1-archive if needed. Implementation: (1) check XINFO GROUPS mystream to see last-delivered-id for each group. (2) identify minimum: min(group-A.last-id, group-B.last-id). (3) trim only up to min: XTRIM mystream MINID ~ min (~ rounds to nearest bucket, safe trim). (4) verify nothing was deleted that consumers need: run XRANGE mystream 0 min on each consumer group and verify they can still read.

Follow-up: If trimming accidentally deleted messages that group-B hasn't processed, how would you recover without losing data?

Your consumer group is consuming a stream at 1K messages/sec. Consumer processes each message with redis.call() from a Lua script. The script is fast (1ms per message), but during peak traffic, you see SLOWLOG entries for XREAD commands taking 100-500ms. The consumer is blocking on XREAD waiting for other consumers to finish their scripts. How do you improve throughput?

Blocking occurs because: (1) consumer reads XREAD BLOCK, but processing is slow (Lua scripts), so consumer is busy. Other readers block on same consumer group. (2) single consumer is saturated. Solution: (1) increase consumer count: add more consumers to consumer group. Each consumer reads independently and processes in parallel. If you have 10 consumers, throughput is ~10x. (2) increase XREAD COUNT: XREAD COUNT 10 BLOCK 1000 STREAMS mystream reads up to 10 messages per batch instead of 1. Consumer processes 10 concurrently (if app allows parallelism). (3) move Lua script outside Redis: instead of redis.call() inside XREAD loop, read messages, process outside Redis (in app memory), then XACK. This unblocks XREAD immediately. Trade-off: not atomic. (4) use pipelining: batch XREAD calls. Send 10 XREAD commands without waiting for response, then process results in parallel. (5) optimize Lua script: profile with redis-cli --stat. If script is I/O-bound (calling external services), break into two steps: (a) read and schedule (fast), (b) process async (external call), (c) ack when done. Implementation: (1) add consumers: XGROUP CREATECONSUMER mystream mygroup consumer-11. (2) verify load distribution: XINFO CONSUMERS mystream mygroup shows pending-count per consumer. (3) measure latency improvement: monitor SLOWLOG. Alert if XREAD > 10ms. Test: redis-benchmark --eval consumer-benchmark.lua 0 -c 100 to simulate 100 concurrent consumers.

Follow-up: If adding more consumers is not possible (resource-constrained), how would you optimize the single consumer to process more messages per second?

Your consumer group reads messages with XREAD, processes each message, then calls XACK to mark as done. But you accidentally deploy buggy code that XREAD without XACK (message is read but never ack'd). After 1 hour, the stream has 3.6M pending messages (1K messages/sec for 1 hour, all unack'd). Memory usage spikes to 5GB. Redis becomes unresponsive. How do you recover?

3.6M unack'd messages = 3.6M entries in PEL (pending entry list). PEL is stored in memory, causing OOM. Recovery: (1) immediately deploy fix to XREAD: add XACK after processing. (2) for existing backlog, clean up: identify which messages are old enough to drop. Run XCLAIM with IDLE parameter to reassign. (3) force-clear PEL: use XAUTOCLAIM with aggressive time window: XAUTOCLAIM mystream mygroup temp-consumer 1000 0-0 FORCE. This claims all messages and moves to temp-consumer, effectively clearing PEL. (4) if critical: restart Redis to flush all in-memory data, reload from persistence. Data loss may occur for messages not persisted. Prevention: (1) implement ack verification: monitor PEL size with XINFO GROUPS mystream. Alert if pending-count grows >100K for >5 minutes. (2) use XPENDING to see oldest unack'd message age: XPENDING mystream mygroup -IDLE 3600000 (messages idle >1 hour). If found, investigate why not being ack'd. (3) set consumer timeout: configure consumer code to have timeout on message processing. If processing takes >30 seconds, auto-ack and move to next message (trade-off: may lose message state). Implementation: (1) after deploying fix, verify XACK is working: run XPENDING mystream mygroup and check pending-count is decreasing. (2) for the current backlog: run script to clean up: EVAL 'local pending = redis.call("XPENDING", KEYS[1], KEYS[2]); for _, msg in ipairs(pending) do redis.call("XACK", KEYS[1], KEYS[2], msg[1]) end' 2 mystream mygroup (cleans all pending). (3) verify: XINFO GROUPS mystream should show pending: 0 after cleanup.

Follow-up: If you can't identify which messages are safe to ack (some might still be processing), how would you safely drain the PEL?

Your stream is replicated: primary pushes messages (XADD), consumer group on replica reads with XREAD. But the replica's consumer group last-delivered-id is ahead of primary's offset (replica somehow read messages that primary hasn't confirmed yet). Now primary and replica are inconsistent. How did this happen and how do you fix?

This shouldn't normally happen (replicas read from primary's replication stream). Likely causes: (1) consumer group was created on replica independently (not replicated from primary). Replica has its own group state that's not synced. (2) replication lag: replica reads ahead of primary (if consumer-group has its own offset tracker). (3) bugs in replication or stream implementation. Fix: (1) delete consumer group on replica: XGROUP DESTROY mystream mygroup on replica, which removes it. (2) recreate group on primary: XGROUP CREATE mystream mygroup $ on primary. This replicates to replica automatically. Verify replication: check XINFO GROUPS mystream on both primary and replica, last-delivered-id should match. (3) if messages exist on replica but not on primary, this indicates replication corruption. Backup replica, restore primary, sync replica from primary. (4) prevent with: use XGROUP CREATE on primary only. Consume on primary, let replication handle sync to replicas. If consuming on replica, ensure group is explicitly created on primary first. Monitor: run periodic consistency check: XINFO GROUPS mystream on primary vs replica, alert if any group's last-delivered-id diverges. Implementation: (1) detect divergence: diff XINFO output on primary vs replica. (2) if divergence exists: restart replica (clean reload from primary's RDB). (3) verify: after restart, run XREAD on both primary and replica, confirm they read the same message ID next.

Follow-up: If you must consume on replicas (to reduce primary load) but maintain consistency, how would you safely coordinate consumer group state?

Your stream stores user activity events. Each message is ~2KB. After 3 months, the stream has 2.7B messages (80GB). Most queries (analytics) use XREAD with a date range filter, but filters must be done client-side (Redis Streams doesn't have server-side filtering). Reading 80GB to client-side filter is impossibly slow. What's the architectural solution?

80GB stream with client-side filtering is impractical (throughput bottleneck). Architectural solution: (1) use secondary indices: instead of one stream, use multiple streams per date: stream:2024-01-01, stream:2024-01-02, etc. When querying date range, only read relevant streams. This is faster and allows retention policies per stream (old streams deleted). (2) use Redis Sorted Sets for indexing: ZADD user-events-index 1704067200 (score = timestamp, member = message ID). Query via ZRANGEBYSCORE to find messages in date range, then read those specific messages from stream. (3) move to specialized database: Streams are for queuing/event logs, not analytics. Use ClickHouse, BigQuery, or S3 for analytical queries. Stream data to both Redis (for real-time) and S3 (for analytics). (4) implement summary tables: instead of querying raw stream, pre-compute hourly summaries (e.g., ZADD user-events-hourly 1704067200 ). Query summaries for analytics (fast), use stream for real-time. Implementation: (1) create stream-shards: XADD stream:2024-01-01 * , XADD stream:2024-01-02 * . On query (date range 2024-01-01 to 2024-01-05), read stream:2024-01-01 through stream:2024-01-05 via XREAD. (2) implement with Redis Modules (RedisTimeSeries) if available for better time-series support. (3) monitor: use MEMORY STATS to track stream size. Alert when stream > 50GB to trigger archival.

Follow-up: If you shard streams by date but queries span multiple shards, how would you efficiently combine results without pulling everything to client?