Kafka Interview — Tiered Storage and Infinite Retention

Your application requires 90-day retention for regulatory compliance. Current disk cost is $50K/month for 50 brokers × 40TB each. Evaluate tiered storage: hot (SSD, 7 days), warm (HDD, 90 days), cold (S3, 1 year). Design the tier architecture and estimate cost savings.

Current setup cost: 50 brokers × 40TB × $25/TB/year (typical enterprise storage) = $50K/month (amortized).

Tiered storage architecture: (1) Hot tier (7 days, SSD): 50 brokers × 3TB (recent data). SSD cost: 3TB × $1000/TB = $150K upfront, $20K/year. Daily cost: $55/day. (2) Warm tier (90 days, HDD): Archive to HDD. 90 days × daily volume = ~100TB total. HDD cost: $5/TB/year = $500/year ($42/month). (3) Cold tier (1 year, S3): Long-term archive. S3 Standard: $0.023 per GB/month. For 100TB × 365 days = 365TB/year stored. Cost: 365TB × 1000 GB/TB × $0.023 = $8395/month (expensive). Use S3 Intelligent-Tiering: $0.0125/GB/month = $4558/month. Or Glacier: $0.004/GB/month = $1495/month (retrieval takes hours).

Estimated savings: (1) Hot tier (SSD, 7 days): $55/day. (2) Warm tier (HDD, 83 days): $42/month. (3) Cold tier (Glacier, 275 days): $1495/month. Total: ~$1700/month (vs $50K/month = 97% savings).

Implementation: Use Kafka 1.10+ tiered storage: set tiered.storage.enable=true. Configure local log segment time (7 days) and remote storage (S3). Brokers automatically promote data from local → remote after TTL.

Trade-off: Cold retrieval is slow (~minutes for Glacier). If compliance requires instant access to 1-year-old data, use cheaper HDD tier instead of Glacier. Cost vs latency.

Follow-up: If a consumer needs to replay data from 6 months ago (currently in Glacier), what's the retrieval latency? How do you provide SLA for time-travel consumption?

Kafka tiered storage is enabled. A consumer requests a message from 60 days ago (currently in remote tier). Broker fetches from S3. Fetch takes 500ms (vs 5ms from local disk). Consumer experiences timeout. Explain and optimize.

Tiered storage fetch mechanism: When consumer requests offset in remote tier: (1) Broker queries remote index (S3); (2) Downloads log segment from S3; (3) Serves message. Total: 500ms (S3 API latency 100ms + network 200ms + disk read 200ms). Consumer timeout is typically 30 seconds, so 500ms shouldn't cause timeout unless network is congested or S3 is overloaded.

Root cause of timeout: (1) Consumer is behind by 60 days, triggering many S3 reads per second. Broker is throttled by S3 rate limits; (2) S3 bucket doesn't have right access tier (Standard vs Intelligent-Tiering); (3) Network bandwidth is saturated (fetching large segments).

Optimization 1 - S3 caching: Kafka brokers cache S3 segments locally (hybrid cache). Frequently accessed old segments stay on SSD, rarely accessed go to S3. Configure: log.local.retention.ms=604800000 (7 days on local), older segments on remote, but brokers maintain 10-50GB cache of frequently accessed remote segments.

Optimization 2 - Batch prefetch: Instead of fetching message-by-message from S3, batch-prefetch segments. If consumer is replaying, fetch multiple segments at once and cache.

Optimization 3 - S3 transfer acceleration: Enable CloudFront or S3 Transfer Acceleration to speed up downloads. Cost: +$0.04 per GB transferred (vs standard $0.02).

Optimization 4 - Segment size tuning: Reduce segment size from 1GB to 256MB. Smaller segments = faster S3 download. Trade-off: more segments in remote store, more index entries.

Production at Uber: Tiered storage on Kafka with hot/warm/cold tiers. Consumer accessing 90-day-old data experiences 100-300ms latency (vs 5ms for local). Acceptable for historical analytics (not real-time). They use S3 caching to bring frequent queries down to 20ms.

Follow-up: If tiered storage cache fills up (20GB available, 1M messages = 50GB needed), what's the eviction policy? LRU? Can you improve cache hit ratio?

Your topic has 10-year infinite retention requirement (regulatory archive). Estimate storage cost and design archival strategy.

Cost calculation: Assume 100M messages/day, 1KB per message. Daily volume: 100GB. 10 years = 3650 days = 365TB total data. Cost per tier: (1) Keep on active Kafka = $25/TB/year × 365TB × 10 = $91,250/year = $7,604/month. (2) Move to cold tier (Glacier) = $0.004/GB/month × 365,000GB = $1,460/month. 95% cost reduction.

Archival strategy: (1) Set local retention = 30 days (enough for active consumers). (2) Move 30+ day-old data to cold tier (S3 Glacier). (3) Keep index of archived segments. When consumer requests old data, fetch from Glacier (slow, hours). (4) For instant access to archives, maintain "hot archive" on cheap HDD (365TB on HDD = $50K upfront + $2K/year. Still cheaper than Kafka's 10-year cost).

Implementation: Use Tiered Storage + bucket lifecycle policies: (1) S3 uploads new segments to Standard tier; (2) After 90 days, transition to Intelligent-Tiering; (3) After 1 year, transition to Glacier Deep Archive ($0.00099/GB/month). (4) For retrieval SLA, use Glacier Flexible Retrieval (retrieval time 3-5 hours) vs Instant Retrieval (retrieval time 1ms, higher cost).

Compliance validation: Use AWS CloudTrail + S3 Object Lock to prove immutability and access history for regulatory audits.

Production at JPMorgan: 7-year retention for trading data. They use tiered storage: 30 days local (Kafka), 1 year on SSD (warm), 6 years on Glacier (cold). Monthly cost: $5K vs $50K for all-local. Compliance team has instant access to 1-year data, archival team can retrieve 7-year data within hours.

Follow-up: If regulatory audit requires proof that data from 2020 wasn't modified, how do you prove immutability with tiered storage?

You're using tiered storage with S3 as remote tier. An S3 bucket deletion ACL is misconfigured. A developer accidentally runs aws s3 rm s3://kafka-archive --recursive. 6 months of hot data is deleted. Recover.

Damage assessment: S3 Versioning would have kept old versions. If versioning is not enabled, deleted objects are gone (unless S3 Object Lock was enabled). If Object Lock is on, deletion is blocked (immutable).

Prevention that should have been enabled: (1) S3 Versioning: keeps deleted objects as non-current versions. Recover by listing non-current versions and restoring. (2) S3 Object Lock: blocks any delete/overwrite for retention period. Requires Governance or Compliance mode. (3) IAM policy: restrict DeleteObject permission. Only admins can delete, not developers. (4) MFA Delete: requires MFA token to delete. (5) S3 Bucket Lock: prevents bucket policy changes.

Recovery if versioning is on: (1) List non-current versions: aws s3api list-object-versions --bucket kafka-archive; (2) Restore deleted versions: aws s3api put-object --bucket kafka-archive --key segment --version-id versionID --body ... (complex, manual). (3) Or use S3 Batch Operations to restore all versions.

Recovery if no versioning (data loss): Kafka has replicas (if replication factor >= 2). Surviving brokers still have local copies of 6-month-old segments (if retention allowed). Rebuild remote tier from local replicas: migrate local segments to S3 again.

Root cause: Misconfigured IAM policy. Fix: (1) Set S3 bucket policy to deny DeleteObject for all except administrators; (2) Enable S3 Object Lock (Governance mode); (3) Enable versioning for all buckets.

Production incident at Dropbox: Accidentally deleted 1TB of archived Kafka data. Had versioning on, recovered within 1 hour. Data loss: 0. Now they enforce Object Lock for all compliance buckets.

Follow-up: If you have Object Lock with 10-year retention, can you still delete objects if you discover they contain PII and need to comply with right-to-be-forgotten GDPR?

Tiered storage is enabled. A broker crashes and doesn't restart. Its local disk (hot tier) is lost. Remote tier (S3) has only 30+ day-old data (younger data wasn't uploaded yet). New broker joins cluster. Can it recover the missing recent data?

Recovery mechanism: Kafka replicas (replication factor >= 2) store data on multiple brokers. If broker A crashes, brokers B and C still have copies of all partitions (hot and remote tier data). New broker joins, rebalances partitions, and fetches missing segments from leader broker.

If replication factor = 1 (dangerous): No replicas, data loss is permanent. The missing recent data (not yet uploaded to S3) is gone forever. Only 30+ day-old data from S3 survives. This is why replication factor >= 2 is essential for critical topics.

If replication factor >= 2: (1) New broker becomes replica of partition. (2) Leader sends recent segments (hot tier data not yet on S3) to new broker. (3) Process: leader fetches segment from its local disk or remote S3, sends to new broker. New broker writes to its local disk. (4) New broker stores these segments locally (hot tier). Older segments (beyond 30 days) are fetched from S3 if needed. (5) After resync, new broker has full partition copy.

Resync time: For 1TB partition with replication factor 2: network transfer 1TB / 100MB/sec = 10 seconds. In practice: 1-5 minutes for rebalancing + recovery overhead.

Prevention: (1) Always use replication factor >= 2 for production topics. (2) Increase hot tier retention to 30+ days (keep recent data local longer). (3) Enable tiered storage for all critical topics. (4) Monitor broker disk health; alert before disk fails.

Production best practice: Use replication factor = 3 for critical data, replication factor = 2 for standard topics.

Follow-up: If a broker crashes and all its disks are corrupted (no data), and all its replica leaders are on that broker, can the cluster survive?

Your tiered storage setup has hot (7 days, local SSD), warm (90 days, EBS), cold (1 year, S3 Glacier). A query accesses data from day 50 (in warm tier on EBS). Fetch latency is 300ms. Warm tier is becoming a bottleneck. Optimize.

Warm tier bottleneck diagnosis: EBS has IOPS limit (typically 3000-16000 IOPS depending on provisioned throughput). If 100 concurrent consumers query day-50 data, each triggering 10 segment fetches = 1000 fetches/sec. EBS IOPS throttling kicks in, each fetch waits 200-300ms.

Optimization 1 - Increase EBS IOPS: Upgrade to io2 (High-IOPS SSD): up to 64000 IOPS. Cost: $0.065/IOPS/month. 10K IOPS = $650/month. Fetches now complete in <10ms. But expensive for cold data.

Optimization 2 - Warm tier caching: Add read-through cache (Redis, Memcached) in front of EBS. Cache frequently accessed segments (day 50, 51, ...). Cache hit ratio: 80% of queries hit cache (10ms latency). Cache miss: fall back to EBS (300ms). Average latency: 0.8 × 10 + 0.2 × 300 = 68ms (vs 300ms).

Optimization 3 - Promote frequently accessed data to hot tier: Monitor access patterns. If day-50 is accessed frequently (daily), promote to hot tier (local SSD, 5ms latency). Adjust retention: hot tier = 90 days (not 7 days) for warm data that's accessed. Trade-off: more SSD capacity needed.

Optimization 4 - Move to S3 Intelligent-Tiering: Instead of EBS warm tier, use S3 with Intelligent-Tiering. S3 automatically promotes frequently accessed objects to frequent access tier (FAST). Cost: auto-scales based on access patterns. No manual tuning.

Production at Databricks: They use S3 Intelligent-Tiering for warm/cold data. Eliminates need for intermediate EBS tier. Latency: 50-100ms for warm data (S3 + network). Cost: $0.0125/GB/month (vs EBS $0.1/GB/month).

Follow-up: If S3 Intelligent-Tiering auto-moves data between tiers based on access, how does this affect consumer offset tracking? If segment moves from Frequent to Infrequent tier mid-fetch, does fetch fail?

Your compliance requirement: "Retain audit logs for 10 years". Using tiered storage, you've stored 10 years on Glacier. Regulatory auditor requests all logs be retrieved and verified within 1 hour. Glacier retrieval takes 12 hours. How do you handle this?

Glacier retrieval time: Flexible Retrieval (deprecated): 3-5 hours. Deep Archive: 12 hours. Instant Retrieval: 1ms (but more expensive). For 1-hour SLA, Glacier doesn't work.

Solution 1 - Use Glacier Instant Retrieval: Slightly more expensive ($0.003/GB/month vs $0.00099/GB). Supports instant retrieval. Cost increase: 3x for cold tier, but acceptable for compliance.

Solution 2 - Maintain hot archive copy: Keep 10-year archive on cheaper HDD (NAS, not S3). NAS retrieval: <1ms. Cost: 365TB HDD = $50K upfront + $2K/year. Total 10-year cost: $70K (vs Glacier $120K + retrieval cost $1K). Hot archive is actually cheaper for instant access.

Solution 3 - Graduated retrieval: Auditor requests audit logs. Immediately start Glacier retrieval (12 hours). In parallel, retrieve all logs from last 1 year (hot/warm tiers, <1 hour). Provide auditor with 1-year logs within 1 hour as interim result. After 12 hours, provide remaining 9-year logs. Satisfies SLA for recent data, full audit in 12 hours.

Solution 4 - Query API vs bulk retrieval: Instead of retrieving entire 10-year archive, implement query API. Auditor specifies date range, query returns only matching logs. This reduces retrieval volume and time. Example: "Show all logs from Jan 2023" retrieves only 30 days instead of 10 years.

Compliance best practice: Combine solutions 2 (1-year hot archive) + 1 (Glacier Instant Retrieval for older data). Satisfies most SLAs and is cost-effective.

Follow-up: If an auditor requests 10-year audit logs and S3 billing shows $10K in retrieval costs, is that acceptable? Should you pre-warm the archive before audit?