You deployed a custom Kubernetes controller that manages database backups. It's falling behind. Your monitoring shows: workqueue depth = 10,000 items. Reconciliation latency is 45 seconds per object. You have 1 controller replica processing serially. Walk through diagnosis and recovery.
Controller falling behind = queue depth growing faster than reconciliation rate. Debug systematically:
1. Check controller metrics (exported via Prometheus):
kubectl exec -n controller-system controller-pod -- /bin/sh -c "curl localhost:8080/metrics | grep workqueue"
# Key metrics:
# workqueue_depth = 10000 (items waiting)
# workqueue_adds_total = rate of new items being queued
# workqueue_queue_duration_seconds = how long items spend in queue
2. Identify the bottleneck. Is it the workqueue adding too fast or reconciliation too slow?
# If adds_total is high but queue is also growing, you're adding faster than processing
# If adds_total is low but queue is still high, processing is too slow
3. Check reconciliation latency per object:
kubectl logs -n controller-system controller-pod | grep -i "reconciliation\|duration"
# Look for patterns: "Reconciled object=backup-123 duration=45000ms"
4. Common culprits for slow reconciliation:
- External API calls (AWS, database) are slow
kubectl logs -n controller-system controller-pod | grep -i "AWS\|timeout\|backoff"
- Excessive etcd/Kubernetes API calls in reconciliation loop
kubectl logs -n controller-system controller-pod | grep -c "GET /api\|POST /api"
- Inefficient resource listing (querying all backups instead of indexing)
5. Immediate fix: Add concurrency to the controller (process multiple items in parallel):
// In controller code:
// Was: workers = 1
// Now: workers = 5
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
MaxConcurrentReconciles: 5,
})
// Rebuild and redeploy
docker build -t backup-controller:v2 .
kubectl set image deployment/backup-controller controller=backup-controller:v2 -n controller-system
6. Monitor recovery:
kubectl logs -n controller-system -f deployment/backup-controller | grep "workqueue depth"
Queue should start draining
7. Once queue is empty, address root cause (slow API calls, caching, etc.)
Quick Fix Priority: Increase workers to 5 and reassess. If still slow, optimize per-object reconciliation time.
Follow-up: Design a monitoring dashboard that alerts when workqueue depth exceeds healthy thresholds. What metrics would you track?
Your controller reconciliation loop is making 50 API calls per object. Each call is to list all backups, filter, and update. You have 100K backup objects. At 1 reconciliation per second, you'll take 100K seconds (28 hours) to process all objects once. How do you optimize this?
50 API calls per object is catastrophic inefficiency. This is an algorithmic problem, not a scaling problem:
1. Audit the current reconciliation logic:
// BAD: Reconcile function for a single Backup object
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// BUG: List ALL backups to find related ones (call 1)
allBackups := &backupv1.BackupList{}
r.List(ctx, allBackups, client.InNamespace(backup.Namespace))
// BUG: For each backup, list all snapshots (calls 2-50)
for _, b := range allBackups.Items {
snapshots := &backupv1.SnapshotList{}
r.List(ctx, snapshots, client.MatchingFields{"backup": b.Name})
}
}
// Result: 50+ API calls per Backup
2. Rewrite using indexed fields and direct lookups:
// GOOD: Use field indexing
func (r *BackupReconciler) SetupWithManager(mgr ctrl.Manager) error {
// Create index: backup.spec.parentBackup -> child backups
indexKey := "spec.parentBackup"
indexFunc := func(rawObj client.Object) []string {
backup := rawObj.(*backupv1.Backup)
var owners []string
if backup.Spec.ParentBackup != "" {
owners = append(owners, backup.Spec.ParentBackup)
}
return owners
}
if err := mgr.GetFieldIndexer().IndexField(ctx, &backupv1.Backup{}, indexKey, indexFunc); err != nil {
return err
}
return ctrl.NewControllerManagedBy(mgr).
For(&backupv1.Backup{}).
Complete®
}
// Reconcile: Use cached index instead of listing
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// Use index: O(1) lookup instead of O(n) list
children := &backupv1.BackupList{}
r.List(ctx, children,
client.InNamespace(backup.Namespace),
client.MatchingFields{"spec.parentBackup": backup.Name})
// Now 1-2 API calls instead of 50
}
3. Use owner references and event-based triggers instead of polling:
// Instead of reconciling ALL backups every time:
// Reconcile only when a backup changes, and only related objects
// In controller setup, use EnqueueRequestsFromMapFunc
ctrl.NewControllerManagedBy(mgr).
For(&backupv1.Backup{}).
Watches(&backupv1.Snapshot{},
handler.EnqueueRequestsFromMapFunc(func(ctx context.Context, obj client.Object) []ctrl.Request {
snapshot := obj.(*backupv1.Snapshot)
return []ctrl.Request{
{NamespacedName: types.NamespacedName{
Name: snapshot.Spec.BackupRef,
Namespace: snapshot.Namespace,
}},
}
})).
Complete®
4. Test the optimized version:
// Benchmark reconciliation time
time kubectl apply -f 100k-backups.yaml
Should now complete in minutes instead of hours
5. Monitor API call rate:
kubectl logs -n controller-system -f deployment/backup-controller | grep "API call"
Should see ~2-3 calls per reconcile, not 50
Key Insight: Most controller bottlenecks are due to inefficient reconciliation logic (N+1 queries, full list scans), not lack of parallelism. Fix the algorithm first.
Follow-up: How would you implement a pre-reconcile audit that detects N+1 API call patterns in custom controllers?
Your custom controller watches Backup objects and creates corresponding Snapshot objects. You deploy the controller, and suddenly a cascade: 1 Backup → 10 Snapshots, each Snapshot triggers reconcile, which updates the parent Backup, which triggers reconcile again. Infinite loop. Your etcd is being hammered. How do you break the cycle?
Reconciliation loop detected. This is a feedback cycle. Debug and fix:
1. Identify the loop pattern in controller logs:
kubectl logs -n controller-system deployment/backup-controller -f | head -50
# Pattern: Backup backup-1 → created Snapshot snap-1 → Backup backup-1 reconciled → created Snapshot snap-1 (again)
2. Check the reconciliation logic for unintended side effects:
// BUGGY RECONCILE:
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// Always create snapshot regardless of state
snapshot := &backupv1.Snapshot{}
r.Create(ctx, snapshot) // PROBLEM: Creates new snapshot every reconcile!
// This triggers a watch event on the snapshot
// The snapshot’s owner reference is the backup
// Watch event triggers Reconcile for the backup AGAIN
// Loop begins
}
3. Immediate mitigation: Delete the controller pod to stop the loop:
kubectl delete pod -n controller-system -l app=backup-controller
This stops the cascade immediately
4. Fix the reconciliation logic with idempotency:
// FIXED RECONCILE:
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// Check if snapshot already exists
snapshot := &backupv1.Snapshot{}
err := r.Get(ctx, types.NamespacedName{
Name: backup.Name + "-snapshot",
Namespace: backup.Namespace,
}, snapshot)
if err != nil && errors.IsNotFound(err) {
// Only create if doesn’t exist
snapshot = &backupv1.Snapshot{
ObjectMeta: metav1.ObjectMeta{
Name: backup.Name + "-snapshot",
Namespace: backup.Namespace,
OwnerReferences: []metav1.OwnerReference{
*metav1.NewControllerRef(backup, backupv1.GroupVersion.WithKind("Backup")),
},
},
Spec: backupv1.SnapshotSpec{BackupRef: backup.Name},
}
r.Create(ctx, snapshot)
} else if err != nil {
return ctrl.Result{}, err
}
// Update snapshot status, don’t recreate
if snapshot.Status.Phase != "Complete" {
snapshot.Status.Phase = "Complete"
r.Update(ctx, snapshot)
}
return ctrl.Result{}, nil
}
5. Add a reconciliation guard to prevent rapid re-queuing:
// Implement exponential backoff
if backup.Status.LastReconcileTime != nil && time.Since(backup.Status.LastReconcileTime.Time) < 5*time.Second {
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
backup.Status.LastReconcileTime = &metav1.Time{Time: time.Now()}
r.Update(ctx, backup)
6. Redeploy the fixed controller:
docker build -t backup-controller:v2-fix .
kubectl set image deployment/backup-controller controller=backup-controller:v2-fix -n controller-system
kubectl logs -n controller-system -f deployment/backup-controller | grep -c "reconcile"
Should see steady, not explosive growth
Best Practice: Always make reconciliation idempotent. Running the same reconciliation twice should have the same effect as running it once.
Follow-up: Design a lint/test tool that detects potential infinite reconciliation loops before deployment to production.
Your controller is reconciling correctly now, but you notice every 2 hours, the workqueue depth spikes to 5000 items for 30 minutes, then drains back to normal. This pattern repeats. You suspect a large object is being modified every 2 hours, causing a cascade of reconciles. How do you find and fix this?
Periodic spike = likely a scheduled operation or a cascading update. Find the trigger:
1. Correlate the spike with system events:
kubectl get events -n backup-system --sort-by='.lastTimestamp' | grep -i "update\|patch\|create"
# Look for patterns at the 2-hour mark
2. Enable controller debug logging to capture what's being reconciled:
kubectl set env deployment/backup-controller -n controller-system LOG_LEVEL=debug
kubectl logs -n controller-system -f deployment/backup-controller | grep -i "reconcile\|enqueue" | head -100
# Sample output: "Reconciling Backup=backup-123, Snapshot=snap-456, ..."
3. Identify the spike: Check if it's a single large object or cascading from a parent:
// In controller code, add detailed logging
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log.WithValues("backup", req.Name).Info("Starting reconcile")
// Count child objects being created/updated
children := &backupv1.SnapshotList{}
r.List(ctx, children, client.MatchingFields{"backupRef": req.Name})
log.WithValues("backup", req.Name, "children", len(children.Items)).Info("Found children")
// If children count is unexpectedly high, log it
if len(children.Items) > 100 {
log.Warn("Unusual number of children for backup", "backup", req.Name, "count", len(children.Items))
}
}
4. Check for external triggers (cron jobs, scheduled tasks):
kubectl get cronjobs -A
kubectl describe cronjob backup-sync -n backup-system
Check if schedule aligns with the 2-hour spike
5. Check if an automated process is updating objects every 2 hours:
kubectl logs -n kube-system deployment/backup-sync | grep -E "2024.*sync|patch|update"
Look for timestamps that match the spike pattern
6. If it’s a cascading update (object A is updated, triggers updates to B, C, D, etc.), add debouncing:
// In controller, use exponential backoff for related objects
if err := r.Update(ctx, snapshot); err != nil {
log.Error(err, "Failed to update snapshot")
// Instead of immediate retry, requeue with backoff
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
// Don’t immediately queue all related objects; let them be picked up by next watch event
7. Or, batch the updates to reduce cascade:
// Instead of updating each child immediately:
// Collect updates and batch them
snapshots := []*backupv1.Snapshot{}
for _, s := range backup.Spec.Snapshots {
snapshot := &backupv1.Snapshot{}
r.Get(ctx, types.NamespacedName{Name: s}, snapshot)
snapshot.Status.Phase = "Processing"
snapshots = append(snapshots, snapshot)
}
// Update all at once, reducing cascade
for _, s := range snapshots {
r.Update(ctx, s)
}
8. Verify the spike is gone:
kubectl logs -n controller-system -f deployment/backup-controller | grep "workqueue depth"
Should see steady state, no 2-hour spikes
Follow-up: Design a monitoring alerting system that detects abnormal reconciliation patterns (cascades, loops, spikes) automatically.
Your controller uses a finalizer to ensure graceful cleanup when a Backup object is deleted. But deletion is hanging for 10 minutes. The finalizer is trying to delete data from an external system (S3, database), but the API is experiencing a temporary outage. The Backup object is stuck in Terminating state. Your SLA says objects must be deletable in < 30 seconds. What's your strategy?
Finalizer stuck on external API timeout = blocking deletion. Balance graceful cleanup with responsiveness:
1. Check the stuck object and finalizer:
kubectl get backup backup-123 -o yaml
# metadata.finalizers should show your controller's finalizer
# deletionTimestamp should be set
# Check object age: should reflect how long it's been terminating
2. Check the controller logs for the external API call:
kubectl logs -n controller-system deployment/backup-controller | grep -i "delete.*S3\|timeout\|external"
# Look for: "Failed to delete S3 backup: timeout after 30s"
3. Immediate solution: Set a timeout on the external API call and proceed anyway:
// BEFORE: No timeout, hangs forever
func (r *BackupReconciler) deleteExternalResources(backup *backupv1.Backup) error {
return r.s3Client.DeleteBucket(ctx, backup.Name)
}
// AFTER: Timeout + fallback
func (r *BackupReconciler) deleteExternalResources(ctx context.Context, backup backupv1.Backup) error {
// Set timeout to 10 seconds
ctx, cancel := context.WithTimeout(ctx, 10time.Second)
defer cancel()
err := r.s3Client.DeleteBucket(ctx, backup.Name)
if err == context.DeadlineExceeded {
log.Warn("External delete timeout, removing finalizer anyway", "backup", backup.Name)
// Still remove finalizer to allow object deletion
// S3 bucket will be cleaned up later by external process
return nil
}
return err
}
4. Implement a background cleanup job for failed deletions:
// Separate controller monitors for orphaned resources
type OrphanedResourceReconciler struct{}
func (r OrphanedResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Every 5 minutes, check for S3 buckets not associated with a Backup
buckets, err := r.s3Client.ListBuckets(ctx)
for _, bucket := range buckets {
backup := &backupv1.Backup{}
err := r.Get(ctx, types.NamespacedName{Name: bucket.Name}, backup)
if err != nil && errors.IsNotFound(err) {
// Backup was deleted but S3 bucket remains
r.s3Client.DeleteBucket(ctx, bucket.Name)
}
}
}
5. Add a finalizer removal deadline to prevent permanent blocking:
// If finalizer has been present for > 5 minutes, force removal
if backup.DeletionTimestamp != nil && time.Since(backup.DeletionTimestamp.Time) > 5time.Minute {
// Remove finalizer by force
controllerutil.RemoveFinalizer(backup, "backup.example.com/finalizer")
r.Update(ctx, backup)
log.Warn("Removing finalizer due to timeout", "backup", backup.Name)
return ctrl.Result{}, nil
}
6. Test the fix:
kubectl delete backup backup-123
Should complete within 15 seconds (timeout + controller reconcile cycle)
Best Practice: Finalizers should have timeouts. Always provide a safety valve to prevent permanent blocking.
Follow-up: Design a framework for finalizers that guarantees removal within a configurable timeout while ensuring eventual cleanup of external resources.
Your controller reconciles 10,000 Backup objects. On cluster startup, all 10,000 are queued for reconciliation. The reconciliation loads data from S3, processes it, and updates the status. Total time: 3 hours. During this time, no other workloads get cluster capacity (high CPU, memory, API bandwidth). New changes to Backups are also delayed. How do you make reconciliation on startup less disruptive?
Full cluster reconciliation on startup is a classic problem. Use rate limiting and prioritization:
1. Monitor the startup behavior:
kubectl logs -n controller-system deployment/backup-controller | head -50
# Should show: "Starting reconciliation of 10000 objects"
kubectl top nodes
# Should show high CPU/memory during reconciliation
2. Implement rate limiting in the controller's workqueue:
import "k8s.io/client-go/util/workqueue"
// In controller setup:
queue := workqueue.NewNamedRateLimitingQueue(
workqueue.NewItemExponentialFailureRateLimiter(5time.Millisecond, 1000time.Second),
"backups",
)
// Limit to 10 reconciles per second
rateLimiter := workqueue.NewMaxOfRateLimiter(
workqueue.NewItemExponentialFailureRateLimiter(5time.Millisecond, 1000time.Second),
&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(10, 1)}, // 10 per second
)
reconciler := &BackupReconciler{
workqueue: queue,
rateLimiter: rateLimiter,
}
3. On startup, defer non-critical reconciliations:
// In controller, detect startup phase
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// Check if this is a startup reconciliation
if r.isStartupPhase() && backup.Status.Phase == "Ready" {
// Don’t reprocess already-ready backups on startup
// Only process pending/failed ones
return ctrl.Result{}, nil
}
// Heavy processing only if needed
if backup.Status.LastUpdateTime != nil && time.Since(backup.Status.LastUpdateTime.Time) < 1time.Hour {
// Recently reconciled, defer
return ctrl.Result{RequeueAfter: 1time.Hour}, nil
}
// Actual reconciliation
r.processBackup(ctx, backup)
}
4. Implement smart reconciliation that prioritizes dirty/pending objects:
// Use a priority queue for the workqueue
type BackupReconcileRequest struct {
ctrl.Request
priority int // 1 = high (needs reconciliation), 0 = low (already reconciled)
}
// When enqueuing, set priority based on object state
func (r *BackupReconciler) enqueueBackup(backup *backupv1.Backup) {
priority := 0
if backup.Status.Phase != "Ready" {
priority = 1 // High priority: needs reconciliation
}
r.workqueue.AddWithPriority(BackupReconcileRequest{
Request: ctrl.Request{NamespacedName: client.ObjectKeyFromObject(backup)},
priority: priority,
})
}
5. Use informer resync interval to prevent full re-reconciliation on startup:
// Set resync period to 10 minutes, not every startup
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
SyncPeriod: &syncPeriod,
})
syncPeriod := 10*time.Minute
// This prevents ALL objects being re-queued on every controller restart
6. Test the optimized startup:
kubectl delete pod -n controller-system deployment/backup-controller
New pod should reconcile much faster
kubectl logs -n controller-system -f deployment/backup-controller | grep "Reconciled"
Should see steady rate (10/sec), not 10,000 at once
7. Monitor improvement:
kubectl top nodes
CPU/memory should remain stable during startup
Key Insight: Startup reconciliation should be smart, not brute-force. Only reprocess objects that actually need it.
Follow-up: Design a leader election system for controllers that ensures only the active replica does expensive startup reconciliation, not all replicas.
You're running 3 replicas of your custom controller for redundancy. Each replica has its own workqueue and reconciliation loop. You notice they're all reconciling the SAME object simultaneously. This causes race conditions: Backup object is being updated by all 3 controllers at once, etcd revision conflicts, and reconciliation failures spike. How do you prevent controller collisions?
Multiple controller replicas reconciling the same object = classic distributed system race condition. Implement proper coordination:
1. Check if this is actually happening:
kubectl get pods -n controller-system | grep backup-controller
# Should show 3 replicas
kubectl logs -n controller-system backup-controller-0 -f | grep "Reconciling" | head -5
kubectl logs -n controller-system backup-controller-1 -f | grep "Reconciling" | head -5
kubectl logs -n controller-system backup-controller-2 -f | grep "Reconciling" | head -5
If you see the SAME object name across different controller logs, they’re racing
2. Implement leader election so only one controller actively reconciles:
import "k8s.io/client-go/tools/leaderelection"
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
LeaderElection: true,
LeaderElectionNamespace: "controller-system",
LeaderElectionID: "backup-controller-leader",
})
// Now only 1 of 3 replicas will be the leader and reconcile
// Others are hot-standbys
3. Verify leader election is working:
kubectl get leases -n controller-system
Should show "backup-controller-leader" lease
kubectl describe lease backup-controller-leader -n controller-system
holderIdentity should be one of the pod names
renewTime should be recent
4. If you want all 3 replicas to actively reconcile (not just 1), implement object-level locking:
// Each object gets a mutex/lock
// Only one controller can reconcile it at a time
import "k8s.io/client-go/util/retry"
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
backup := &backupv1.Backup{}
r.Get(ctx, req.NamespacedName, backup)
// Try to acquire a lock by setting a lock annotation
if backup.Annotations == nil {
backup.Annotations = make(map[string]string)
}
lockKey := "backup.io/lock-holder"
myID := os.Getenv("POD_NAME")
if backup.Annotations[lockKey] != "" && backup.Annotations[lockKey] != myID {
// Another controller is reconciling this object, back off
return ctrl.Result{RequeueAfter: 1*time.Second}, nil
}
// Set lock
backup.Annotations[lockKey] = myID
if err := r.Update(ctx, backup); err != nil {
// Update failed, another controller got the lock first
return ctrl.Result{RequeueAfter: 1*time.Second}, nil
}
// I have the lock, proceed with reconciliation
defer func() {
// Release lock
backup.Annotations[lockKey] = ""
r.Update(ctx, backup)
}()
r.processBackup(ctx, backup)
}
5. Use the standard approach: implement with Conditions for optimistic concurrency
// Better: Use Conditions in the status to track who’s reconciling
if backup.Status.Conditions == nil {
backup.Status.Conditions = []metav1.Condition{}
}
// Check if another controller is reconciling
for _, cond := range backup.Status.Conditions {
if cond.Type == "Reconciling" && cond.Status == "True" {
holderID := cond.Message
if holderID != myID && time.Since(cond.LastTransitionTime.Time) < 30time.Second {
// Another controller is active, back off
return ctrl.Result{RequeueAfter: 1time.Second}, nil
}
}
}
// Set reconciling condition
meta.SetStatusCondition(&backup.Status.Conditions, metav1.Condition{
Type: "Reconciling",
Status: "True",
Reason: "ReconciliationStarted",
Message: myID,
ObservedGeneration: backup.Generation,
})
r.Status().Update(ctx, backup)
// Process
r.processBackup(ctx, backup)
// Clear condition
meta.SetStatusCondition(&backup.Status.Conditions, metav1.Condition{
Type: "Reconciling",
Status: "False",
})
r.Status().Update(ctx, backup)
6. Test with leader election enabled:
kubectl logs -n controller-system -f deployment/backup-controller | grep -E "Leader election|Reconciling"
Should see only 1 replica doing reconciliation
Recommendation: Use leader election for simplicity. Only one controller is active, reducing operational complexity.
Follow-up: Compare leader election vs. object-level locking for multi-controller systems. When would you use each approach?