Kafka Interview — Kafka Security: ACLs, SSL, and SASL

Your production Kafka cluster runs with no security (PLAINTEXT). You need to enable mTLS (mutual TLS) between brokers and all clients. 50 brokers, 200 clients, live traffic. Design the rollout without downtime.

Phase 1 - Generate certificates: (1) Create CA cert; (2) Generate broker certificates (CN=broker-X.kafka.company.com for each broker); (3) Generate client certificates. Use a tool like cfssl or OpenSSL. (4) Store all certs in /etc/kafka/secrets/ on each broker.

Phase 2 - Configure dual listeners: Add both PLAINTEXT and SSL listeners to brokers: listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093. Advertise both: advertised.listeners=PLAINTEXT://broker-1:9092,SSL://broker-1:9093. Both brokers and clients can connect via either port. This is the key to zero-downtime rollout.

Phase 3 - Client rollout: Update clients one by one to use SSL: configure security.protocol=SSL, ssl.truststore.location=/path/truststore.jks (containing CA cert). Clients connected to SSL port (9093) are protected. Remaining PLAINTEXT clients (9092) continue unaffected.

Phase 4 - Disable PLAINTEXT: After all clients are using SSL, remove PLAINTEXT listener from brokers. Set listeners=SSL://0.0.0.0:9093 and restart brokers one by one (rolling restart, no downtime because clients already use SSL).

Production example: Kafka at Netflix supports dual listeners for 2 weeks during migration. Old clients connect to 9092 (PLAINTEXT), new clients to 9093 (SSL). They monitor both ports; when PLAINTEXT traffic drops to 0, they decommission it.

Follow-up: If a broker crashes and its SSL certificate is invalid, how do you detect this before clients fail? Should you implement certificate expiration monitoring?

Your cluster has SSL enabled. A client receives certificate verification error: "unable to find valid certification path to requested target". The broker cert is valid. Diagnose the root cause.

Root causes (in order of likelihood): (1) Client's truststore doesn't contain the CA certificate. The truststore is a JKS file with trusted CAs. If the CA that signed the broker cert isn't in the truststore, verification fails; (2) Hostname mismatch: broker cert CN or SAN (Subject Alternative Name) doesn't match the hostname the client is connecting to. E.g., broker cert CN=broker-1.internal.company.com, but client connects to broker-1.external.company.com; (3) Certificate chain incomplete: broker cert is valid, but intermediate CA certs are missing from broker's SSL config. Broker must present full chain; (4) Expired cert: broker cert expired, clock skew between broker and client, or cert not yet valid.

Debugging: (1) List truststore contents: keytool -list -v -keystore /path/truststore.jks; check if broker's CA is present; (2) Check broker cert: keytool -list -v -keystore /path/keystore.jks, verify CN matches expected hostname; (3) Test connectivity: openssl s_client -connect broker:9093 -CAfile ca-cert.pem; (4) Check client logs for exact hostname mismatch error.

Fix: If CA missing from truststore, export broker CA and import: keytool -import -alias ca -file ca-cert.pem -keystore truststore.jks. If hostname mismatch, either regenerate broker cert with correct CN/SAN, or set ssl.endpoint.identification.algorithm= (empty string) to disable verification (not recommended for production).

Production incident: Databricks had clients fail after broker cert renewal. New cert had CN=broker-1, old cert had CN=kafka-broker-1. Clients were hardcoded to old hostname. Fix: use SAN (Subject Alternative Names) to support multiple hostnames.

Follow-up: If you have 50 brokers and need to renew all SSL certificates, how do you minimize downtime? Can you roll the certs without restarting brokers?

You've enabled ACLs: Producer application needs to produce to topic "orders", consumer to topic "events". Design the principle-of-least-privilege ACL rules.

ACL model: Kafka ACLs are (Principal, Resource, Operation, PatternType). Example: Principal="User:producer-app", Resource="Topic:orders", Operation="Write", PatternType="Literal".

Producer ACL: kafka-acls.sh --bootstrap-server localhost:9092 --add --allow-principal User:producer-app --operation Write --topic orders. Also grant Describe (needed for metadata fetch): --operation Describe --topic orders. And IdempotentWrite if using idempotence: --operation IdempotentWrite --cluster.

Consumer ACL: --allow-principal User:consumer-app --operation Read --topic events, --operation Describe --topic events, and --operation Read --group consumer-group-1 (for offset commits).

Principle of least privilege: Don't grant blanket "All" operations. Grant only Read, Write, Describe, Create, Delete, etc. as needed. Use resource patterns: --resource-pattern-type Prefixed --topic events_* to grant all topics starting with "events_".

Production ACLs at Uber: They use service accounts (Principal=User:service-name). Each microservice has a unique principal. Producer service can Write only to its output topics. Consumer service can Read only from its input topics. This prevents accidental data corruption.

Example full setup: (1) Producer app: Write+Describe on topic "orders", IdempotentWrite on cluster; (2) Consumer app: Read+Describe on topic "events", Read on consumer group "my-group"; (3) Schema registry: Read on internal topics, Write for schema updates.

Follow-up: If a consumer app needs to create a new consumer group at runtime, what ACLs are required? How do you handle dynamic group names?

You have 100 microservices, each needing different Kafka permissions. Managing 100 individual ACL rules is complex. Design a centralized ACL management system.

Approach 1 - ACL as code (GitOps): Store all ACL rules in a YAML file in Git. Use a controller service (e.g., Python script) that reads the YAML, compares against current cluster ACLs, and applies changes. Example YAML:

services:

- name: payment-processor

principal: User:payment-processor

permissions:

- resource: Topic:payments

operations: [Read, Write]

- resource: ConsumerGroup:payment-group

operations: [Read]

Approach 2 - Service mesh integration: Use Kafka ACLs as projections of a service mesh config (e.g., Istio, Consul). When a service is registered in the mesh, the mesh controller automatically creates Kafka ACL rules matching the service's ingress/egress policies. Reduces manual config.

Approach 3 - Dynamic ACL provisioning: Each microservice onboards via API. API validates request, generates unique principal (User:service-name-UUID), grants minimal permissions, stores state in a control database. Service receives credentials (keystore, truststore). Platform team can audit all permissions in a web dashboard.

Production at Shopify: They use Approach 1 (GitOps). ACLs are versioned in Git, reviewed via PRs, applied by a controller. Audit trail: every change is a Git commit. Rollback: revert the commit, controller re-applies old ACLs.

Follow-up: If a service needs to produce to topic "events" but not read it, how do you prevent misconfiguration in your centralized system? Should you add validation or linting?

Your cluster has SSL enabled. A client is connecting via SASL/SCRAM (username/password). During peak traffic, broker authentication latency increases from 5ms to 500ms. Diagnose and optimize.

Root cause: SASL/SCRAM authentication happens on every connection. Broker performs PBKDF2 key derivation (CPU-intensive) to validate password. Default iterations is 4096; at scale with many clients connecting, CPU saturates. Also, if SASL auth is backed by external system (LDAP, database), network latency adds.

Optimization 1 - Reuse connections: Clients should maintain persistent connections, not create new ones per request. TCP handshake + TLS + SASL auth = 100ms minimum. Reusing one connection for 10K requests reduces per-request auth overhead to 0.01ms.

Optimization 2 - Tune SCRAM iterations: Reduce scram.server.max.iterations=4096 to scram.server.max.iterations=1024. Trades security for speed. Not recommended; instead, use OAuth2 or OIDC for external auth (one auth per session, not per connection).

Optimization 3 - Delegate auth to external service: Use PLAIN mechanism with a custom auth plugin. Keep credential store (database) close to broker (same DC, low-latency). Cache validation results: after successful auth, store (principal, expiry) in broker's memory for 1 hour. Reduces per-connection auth overhead.

Optimization 4 - Use mTLS (if applicable): mTLS cert validation is faster than password hashing. If all clients are internal and can support cert rotation, use SSL instead of SASL/SCRAM.

Production at Confluent Cloud: They use OAuth2 + mTLS. Clients authenticate once via OAuth2 (obtain token), then use token in mTLS cert. Subsequent connections reuse token; minimal auth overhead.

Follow-up: If you cache auth validation results for 1 hour, what happens if a user's permissions change? Is there a race condition or eventual consistency window?

Your ACLs are stored in ZooKeeper (Kafka classic mode). During recovery, ZK is restored from a 2-week-old snapshot. ACLs are reverted. Broker startup hangs because it can't fetch ACLs from stale ZK. Design recovery.

ACL storage: In ZK mode, ACLs are stored in /config/users and /config/brokers. In KRaft mode, ACLs are part of the metadata log (compacted). If ZK restores from old snapshot, ACL data is stale. New services or revoked permissions won't be reflected.

Recovery steps: (1) Before broker startup, manually restore the latest ACL snapshot. If using Git-based ACL-as-code, re-apply current ACLs: kafka-acls.sh --bootstrap-server localhost:9092 --add --allow-principal User:service-name --operation Read --topic events (idempotent, safe to re-run); (2) If using external ACL system, re-sync: call your sync script to fetch current ACLs from auth system (LDAP, database) and apply to Kafka; (3) Verify ACLs were applied: kafka-acls.sh --list --bootstrap-server localhost:9092; compare to expected state.

Prevention: (1) Migrate to KRaft + versioned metadata log (no external ZK dependency for ACLs); (2) Implement ACL backup: export all ACLs daily, store in Git or S3. On recovery, restore from latest backup; (3) Use declarative ACL config: store all ACLs in a YAML config, controller applies on startup.

Incident at Airbnb: ZK restored old ACLs. New service couldn't produce (permissions missing). Caused 2-hour data pipeline outage. Fix: migrated to KRaft + Git-based ACL config.

Follow-up: If you're migrating from ZK ACLs to KRaft ACLs, how do you ensure no ACLs are lost or duplicated during the transition?

You're implementing TLS certificate pinning for extra security: clients pin the broker's exact certificate (not just trust the CA). Design the rollout without breaking existing clients during cert rotation.

Challenge: If a client pins a specific cert (e.g., sha256=ABC123...), and you rotate to a new cert with sha256=XYZ789, the client immediately fails to connect. You can't gradually roll out without either (a) accepting 2 certs temporarily, or (b) forcing all clients to update simultaneously.

Solution 1 - Certificate pinning with backup pin: Client pins both current cert (sha256=ABC123) and next cert (sha256=XYZ789). You pre-generate next cert, distribute it to clients. When you rotate, clients accept new cert because they pinned it ahead of time. Next rotation, clients pin cert3 before you rollout cert2.

Solution 2 - Public key pinning (less strict): Instead of pinning the full certificate, pin the public key or the CA cert (more stable). Public key remains the same across cert renewals if you use the same key. Easier to rotate without client updates.

Solution 3 - Gradual client update: If you can't pre-distribute pins, perform client update first (deploy new client code that supports new cert), then rotate broker certs. Requires coordinated deployment with clients.

Production at Google: They use certificate transparency + public key pinning. Clients pin the root CA public key (stable for years). Leaf certs rotate monthly; clients don't need updates. Transparency log provides audit trail of all cert issuance.

Gotcha: If your pinned cert expires without notice and you rotate, clients will hard-fail (no fallback). Implement strong alerting 60 days before cert expiration.

Follow-up: If a client pins a cert and the private key is compromised, how do you revoke the cert without breaking the client connection? Can you use OCSP (Online Certificate Status Protocol)?