Kafka Interview — Dead Letter Queue Patterns

Your consumer processes events from topic "orders". 5% of messages are malformed (invalid JSON, missing required fields) and crash the consumer app with exceptions. Throughput drops from 10K msg/sec to 0. Design a DLQ pattern to handle poison pills without blocking the main pipeline.

Pattern 1 - Try-catch DLQ: Wrap message processing in try-catch. On exception, send to DLQ topic instead of crashing:

try {

Order order = parseJson(record.value());

processOrder(order);

commitOffset(record.offset());

} catch (Exception e) {

producer.send(new ProducerRecord<>("orders-dlq", record.key(), record.value() + "|error=" + e.getMessage()));

commitOffset(record.offset()); // Mark as processed (even though it went to DLQ)

}

Pattern 2 - SMT (Single Message Transform) DLQ: Use Kafka Connect SMT to route bad messages to DLQ automatically. Connector catches deserialization errors and sends to a separate topic.

Pattern 3 - Consumer timeout + skip: If processing takes >5 seconds, skip and send to DLQ. This prevents infinite retry loops.

Key design decisions: (1) Commit offset after DLQ send? Yes, otherwise consumer restarts, retries forever. (2) Store original message in DLQ or include error? Both. Preserve original for replay; add error details for debugging. (3) Monitor DLQ? Yes, alert if DLQ grows >1K messages/hour (indicates systemic issue). (4) Replay strategy? Manual human review, fix root cause, then replay good messages from DLQ back to main topic.

Production example at PayPal: 0.1% poison pills (corrupted payment messages). DLQ pattern prevents cascading failures. DLQ is monitored; when rate exceeds threshold, alerts trigger investigation (usually upstream schema mismatch).

Follow-up: If you send bad messages to DLQ and commit the offset, but the fix takes 2 weeks, how do you replay without re-processing good messages that came after the poison pill?

Your DLQ receives 1000 messages/hour. Engineers manually review and fix issues. They want to replay fixed messages automatically. Design a DLQ replay system.

Replay architecture: (1) DLQ messages are stored with original payload + error metadata. (2) Engineer reviews message in UI, identifies root cause (e.g., schema mismatch, timezone bug). (3) Engineer marks message as "ready for replay". (4) Automated job re-publishes marked messages to main topic. (5) Consumer processes re-published message. If successful, message is deleted from DLQ. If fails again, returned to DLQ with new error info.

Implementation: (1) Add metadata to DLQ message: { original_payload: {...}, error: "...", replay_count: 0, marked_for_replay: true }; (2) Separate consumer group for DLQ: consumer-group = "orders-dlq-replay"`; (3) Replay service: fetches messages marked for replay, sends to main topic; (4) Main consumer handles re-play gracefully (idempotence key = original offset to prevent duplicates).

Idempotence key: Use key = "original_offset_" + dlq_record.offset(). If replay is retried, same key prevents duplicates.

Edge cases: (1) If replay fails 3 times, move to "unresolvable" topic for deep investigation. (2) If original message was malformed JSON, DLQ can't parse it to replay. Store as blob + hexdump, engineer can manually fix or trigger reparse with updated schema. (3) Time-sensitive messages: if original had timestamp = 1 week ago and you replay now, downstream systems might reject (expired). Add replay_at timestamp to bypass freshness checks.

Production at Shopify: DLQ replay system processes 10K messages/day. 99% replay success (only 100 unresolvable). Manual review takes 1-5 minutes per issue.


  Follow-up: If an engineer marks 1000 messages for replay but the root cause (schema mismatch) is still not fixed, what prevents infinite replay loops? How do you add a circuit breaker?



  You have multiple services producing to the same Kafka cluster. Each produces to its own topic, but they all share a single DLQ topic. DLQ messages are mixed: order processing errors, payment errors, shipping errors. How do you route replies to the right service?
  Unified DLQ challenge: All errors go to one "global-dlq" topic. Messages lose origin metadata. When replaying, you don't know which service should process the replay.
Solution 1 - Add origin header: Every message includes header origin-service: "order-processor". DLQ preserves headers. Replay consumer reads header and routes to correct service. Example:
String originService = record.headers().lastHeader("origin-service").value();
if (originService.equals("order-processor")) { send(record, "orders-main"); }
Solution 2 - Service-specific DLQs: Each service has its own DLQ: "orders-dlq", "payments-dlq", "shipping-dlq". Simpler routing logic. Trade-off: more topics to manage.
Solution 3 - DLQ routing topic: Global DLQ forwards to service-specific topics based on error type: if (error.contains("JSON")) { send to "orders-dlq"); } Fragile heuristic, only works if error messages are distinctive.
Best practice: Solution 1 + Solution 2 hybrid. Each service produces with origin-service header. DLQ partitions by origin-service (key = origin-service). Replay consumer subscribes to global DLQ, filters by origin, routes accordingly.
Production at Amazon: AWS uses origin headers + metadata. DLQ messages include (origin-service, original-topic, error-code, timestamp). Automated routing determines which team should handle the replay.
  Follow-up: If origin-service header is missing (old message format), how do you handle backward compatibility? Should you infer origin from the message content or require header?


  Your consumer processes payment messages. Poison pill message causes consumer to crash. You fix the bug and restart the consumer. It immediately re-encounters the same poison pill and crashes again. Design a mechanism to skip poison pills after N failures.
  Auto-skip pattern: Implement a "skip after N errors" mechanism. Track (topic, partition, offset, error-count) in a state store (RocksDB, cache, database). If same message fails N times, skip it (commit offset without processing) and send to DLQ.
Implementation:
Map<String, Integer> failureMap = new HashMap<>(); // key = "topic-partition-offset", value = error_count
String key = record.topic() + "-" + record.partition() + "-" + record.offset();
try {
  processMessage(record);
  failureMap.remove(key); // Success, clear failure count
} catch (Exception e) {
  int count = failureMap.getOrDefault(key, 0) + 1;
  if (count >= 3) {
    producer.send("payments-dlq", record); // Auto-skip to DLQ
    failureMap.remove(key);
  } else {
    failureMap.put(key, count);
    throw e; // Re-throw, consumer will retry on next poll
  }
}
Persistence: If consumer restarts, failure map is lost. To persist: store in a compacted state topic (offset-store). Every failure increment persists. On restart, replay state topic to rebuild failure map.
Trade-off: After skipping a poison pill, it's gone forever (sent to DLQ). If the bug is later fixed, the message isn't replayed automatically. Solution: DLQ retention is infinite (or very long), engineers can manually replay.
Production at Stripe: Payment processor uses skip-after-3 pattern. Prevents cascading failures. 99.9% of poison pills are skipped after 3 failures; only 0.1% require manual investigation.
  Follow-up: If you set N too low (N=1), every transient error sends to DLQ. If N too high (N=10), consumer stays crashed for long. How do you choose N dynamically based on error type?


  Your DLQ topic has been growing for 3 months. 50K unresolved messages. Storage cost is high. Audit team needs to retain DLQ for 1 year for compliance. Design a tiered retention strategy.
  Tiered retention pattern:
(1) Hot tier (active DLQ topic): 7-day retention. Recently failed messages are here. Engineers review and fix quickly. retention.ms=604800000.
(2) Warm tier (archive topic): 90-day retention. Resolved messages + messages pending investigation. retention.ms=7776000000.
(3) Cold tier (compliance storage): 1-year retention on S3 or data lake. Annual export for audit.
Implementation: (1) Hot topic stores live DLQ messages; (2) Resolved messages (marked as "fixed") are moved to warm topic: consumer subscribes to hot DLQ, if message.is_fixed, send to warm-dlq; (3) After 90 days in warm topic, export to S3: daily batch job exports cold topic to s3://dlq-archive/2026/04/07/; (4) Delete from Kafka after successful S3 export.
Cost breakdown: Hot (7 days): 50K messages × 1KB avg × 50 partitions = 2.5GB, $25/month storage. Warm (90 days): $100/month. Cold (S3): $10/month (Glacier). Total: $135/month vs $1000+/month for 1-year hot retention.
Compliance proof: S3 versioning + AWS CloudTrail provides audit trail. Audit team can query archive dates, confirm messages were retained.
Production at Capital One: Regulatory DLQ (PII data). Hot topic (7 days) on Kafka, warm (30 days), cold (7 years on S3 with encryption). Complies with data retention regulations.
  Follow-up: If a message needs to be replayed from cold storage (S3), how long does retrieval take? Can you support real-time replay from S3, or is batch replay only?


  Your DLQ has a message that was malformed due to a downstream schema change. You fix the schema, but now DLQ can't deserialize the old message (old schema no longer supported). How do you handle this?
  Schema evolution issue: Original message was encoded in schema v1. You upgraded to schema v2, which removed a field. When deserializing from DLQ, schema registry rejects (field mismatch). Message stays unresolvable.
Solution 1 - Schema compatibility mode: Kafka Schema Registry supports backward compatibility. Set compatibility.level=BACKWARD. New schema can read old messages. Ensure new schema is additive (new fields are optional).
Solution 2 - Custom deserializer: Implement fallback deserializer: try schema v2, if fails, try schema v1, store both versions in message object. Consumer logic handles both versions.
Solution 3 - DLQ message format: Store DLQ messages as blobs (raw bytes) + schema version tag. Example: { payload: [bytes], schema_version: 1, error: "..." }. When replaying, use correct schema for deserialize.
Best practice: Store DLQ messages in format-agnostic way. Never assume future schema can deserialize. Option 3 is safest.
Production at Confluent: DLQ messages include metadata: { original_bytes, schema_id, error, timestamp }. Schema registry is queried by schema_id to deserialize. If schema is deleted, engineers can manually reconstruct from audit logs.
  Follow-up: If you have 100 unresolved DLQ messages with mixed schema versions, how do you batch-replay them without triggering cascading deserialize failures?


  Your consumer group has 5 instances. Instance A encounters a poison pill, sends to DLQ, but crashes before committing offset. Instance B takes over partition, encounters same poison pill. It also sends to DLQ, then crashes. Duplicates accumulate in DLQ. Design a solution.
  Root cause: Crash happens between DLQ send and offset commit. Offset is not committed, so next consumer (Instance B) re-processes same message, re-sends to DLQ. Result: N instances × multiple crashes = N duplicate DLQ messages.
Solution 1 - Transactional DLQ send: Combine DLQ send and offset commit in atomic transaction. Use enable.idempotence=true on producer, isolation.level=read_committed on consumer. If crash occurs mid-transaction, both are rolled back. On restart, offset is not committed, message is re-processed (may send to DLQ again, but broker deduplicates based on idempotence key).
Implementation: Wrap in Kafka transactions:
consumer.beginTransaction(); // Wait, Kafka consumer doesn't support transactions directly
Better: use Kafka Streams. Kafka Streams handles exactly-once semantics automatically. State updates and offset commits are atomic. DLQ send is part of the state update.
Solution 2 - DLQ idempotence key: When sending to DLQ, use idempotence key = original message offset. Even if sent multiple times, broker deduplicates (only one copy in DLQ).
producer.send(new ProducerRecord<>("dlq", "offset-" + record.offset(), record.value()));
Production pattern at Netflix: They use Kafka Streams for consumer applications requiring exactly-once guarantees. Automatic deduplication prevents DLQ corruption from crashes.
  Follow-up: If you use deduplication key = offset, and the same message appears in multiple topics (replicated via MirrorMaker2), will dedup keys collide across topics? Should keys include topic name?