Kubernetes Interview — Scheduler Predicates, Priorities, and Bin-Packing

A pod is Pending for 20 minutes. kubectl describe shows no events. The cluster has 10 nodes, all appear to have free resources. What's your first debugging step?

Pending with no events usually means the scheduler has made a decision but hasn't updated the pod status yet, OR the events are being dropped. Refresh aggressively: kubectl describe pod -w to catch events as they appear. Events are often cached and delayed.

While waiting for events, check the scheduler logs: kubectl logs -n kube-system deployment/kube-scheduler | tail -50. Look for the pod name in the logs to see why the scheduler is rejecting it.

If there are no logs, the scheduler might not even be trying to schedule the pod. Check if the pod has a spec.nodeName field: kubectl get pod -o yaml | grep nodeName. If it's populated but the node doesn't exist, the pod is stuck. If it's empty, the scheduler hasn't tried to place it yet.

Check the scheduler status: kubectl get deployment -n kube-system kube-scheduler. Is the scheduler pod running? kubectl logs -n kube-system deployment/kube-scheduler | grep -i "leader\|elected". If there's no leader, the scheduler isn't running.

If the scheduler is running, force a sync: kubectl annotate pod <name> 'force-schedule=true' --overwrite (this doesn't actually force scheduling in vanilla K8s, but it can trigger a requeue in the scheduler). The real workaround is to delete and re-create the pod: kubectl delete pod <name> and let the controller recreate it.

Follow-up: You found events—"no nodes are suitable." All 10 nodes show free CPU and memory in kubectl top. Why does the scheduler still say no suitable nodes?

You add a nodeSelector to a pod: nodeSelector: disk=ssd. The pod becomes Pending. You verify the node label exists. The scheduler says "no suitable nodes." What's wrong?

Verify the label is spelled correctly and the value matches exactly. Labels are case-sensitive. Check the node's actual labels: kubectl get nodes --show-labels and look for disk=ssd. If you see disk=SSD (capital), that won't match your selector.

Verify the pod's nodeSelector is spelled correctly in the spec: kubectl get pod -o yaml | grep -A5 nodeSelector. Is it nodeSelector (not nodeSelectorTerm, not nodeAffinity)? A simple nodeSelector is just a key-value match.

Check if the node you labeled has enough resources for the pod. Even with the right label, if the node is full, the scheduler rejects it. Look at the describe output for the exact reason: kubectl describe pod <name> | grep -A10 "Events:". The message will say "no nodes match selector" or "insufficient resources."

Count the nodes with the label: kubectl get nodes -l disk=ssd. If it returns no nodes, the label doesn't exist or is misspelled. If it returns nodes, check if those nodes have available resources: kubectl describe node <node> | grep -A10 "Allocated resources".

As a test, create a pod without the nodeSelector and see if it schedules: kubectl run test --image=busybox. If it schedules elsewhere, the label constraint is the issue. If it stays Pending too, the cluster is likely full.

Follow-up: The nodeSelector matches correctly. The node has free resources. But the pod still won't schedule. The describe output says "no suitable nodes" but doesn't give a specific reason. What now?

You have a node with a taint: NoSchedule key=value. You add a matching toleration to a pod. The pod schedules to that node. But the pod to another node with the same taint also schedules. Why?

Taints come with effects: NoSchedule, NoExecute, and PreferNoSchedule. If the second node has PreferNoSchedule (not NoSchedule), pods without the toleration can still schedule there—it's just a preference, not a hard block.

Check both node taints: kubectl describe node <node1> | grep Taints and kubectl describe node <node2> | grep Taints. If node1 shows "key=value:NoSchedule" and node2 shows "key=value:PreferNoSchedule", that explains it. Your toleration satisfied the taint on both nodes, but the second node's taint is softer.

Or, the second node might not have the taint at all. You thought you tainted it but didn't finish the command, or the taint was removed. Verify by checking all node taints: kubectl describe nodes | grep -B2 "Taints" to see the full picture.

Note: NoExecute means existing pods are evicted if they don't tolerate the taint. NoSchedule means new pods can't schedule; existing pods stay. PreferNoSchedule is just a weight in scheduling—the scheduler prefers not to schedule there but will if necessary.

Follow-up: Both nodes have identical taints and effects. The pod has a matching toleration for one but not the other. Yet the pod schedules to both nodes. How?

You have 3 nodes. Node A and B have 8 cores free, Node C has 2 cores free. You deploy a pod requesting 4 cores. The scheduler picks Node C. Why would it pack the pod onto the smallest-available node?

That's counter-intuitive. By default, the kube-scheduler uses the SelectorSpreadPriority and TaintTolerationPriority, which favor spreading pods across nodes. But the exact behavior depends on the scheduler scoring.

Check the scheduler config: kubectl get configmap -n kube-system kube-scheduler-config -o yaml (or similar; the config location varies by setup). Look at the enabled plugins and their weights. If your scheduler has NodeResourcesFit with LeastAllocated strategy, it tries to spread workload. If it has MostAllocated, it consolidates onto fewer nodes (bin-packing).

In newer K8s (1.19+), the default is LeastAllocated for spreading. But if you're on an older cluster or someone customized the scheduler, MostAllocated would explain bin-packing behavior—it tries to fit pods onto the smallest-available-resource node to minimize fragmentation.

Another factor: affinity rules. If there's pod affinity or node affinity that prefers Node C, it would schedule there. Check the pod spec: kubectl get pod -o yaml | grep -A15 "affinity".

Or, if Node A or B have taints and the pod doesn't tolerate them, they're filtered out and only Node C is available. Always check taints as a first filter in the scheduler's decision.

Follow-up: You change the scheduler to LeastAllocated. The pod still goes to Node C. What else could prefer Node C?

Two pods compete for a single available node slot. Both have the same priority. The scheduler has to pick one; the other becomes Pending. How does the scheduler break the tie?

If priorities are equal, the scheduler looks at arrival order. The pod created first gets priority. Check pod creation timestamps: kubectl get pods -o wide | sort -k6 (sorts by CreationTimestamp). The oldest pod wins.

But there's more nuance. The scheduler doesn't just pick the pod; it scores all nodes for each pod. For a given pod, it calculates a score for every node (based on resources, affinity, taints, etc.). It picks the node with the highest score. If multiple pods are competing, they're evaluated in order (oldest first), and the first pod to get a node "wins" and is scheduled.

In practice, if both pods have equal priority and the node scores are similar, whichever pod the scheduler evaluates first gets the node. That's usually the oldest pod.

But if one pod has a nodeAffinity or anti-affinity to the available node, that affects the score. Pod A might score 80 for Node X; Pod B scores 90 for Node X. Pod B gets Node X even if Pod A is older.

To control tie-breaking explicitly, use priorityClassName. Pods with higher priority class are scheduled first. So if you want Pod A to win, give it a higher priority: priorityClassName: important on Pod A and a lower priority on Pod B.

Follow-up: Pod A has older creation time and higher priority, but Pod B still gets scheduled first. What's happening?

You have a pod with podAntiAffinity set to spread replicas across nodes. 3 nodes available, 3 replicas desired. But only 2 replicas schedule; 1 is Pending. Why?

Check the anti-affinity topology key: kubectl get pod -o yaml | grep -A10 "podAntiAffinity". Look at the topologyKey—it's usually kubernetes.io/hostname (one per node) or topology.kubernetes.io/zone (one per availability zone).

If topologyKey=kubernetes.io/hostname and you have 3 nodes, 3 replicas should fit. But if the 3rd replica has an affinity requirement or taint that conflicts with the first two, it can't satisfy the anti-affinity constraint. For example, Pod 1 schedules to Node A, Pod 2 schedules to Node B, Pod 3 wants a different node but Pod 1 and Pod 2 are also in zone-a (if the topology is by zone), so Pod 3 can't find a distinct zone.

Check the anti-affinity rules more carefully: is it requiredDuringSchedulingIgnoredDuringExecution (hard constraint) or preferredDuringSchedulingIgnoredDuringExecution (soft, best-effort)? If it's required and there aren't enough distinct topology domains, the pod stays Pending indefinitely.

Verify the pod count: kubectl get pods --selector=app=myapp -o wide. Are there really only 2 scheduled? If so, describe one of the scheduled pods and one of the Pending pods to compare their anti-affinity constraints.

As a test, relax the topology key or switch to soft anti-affinity (preferred) instead of required. Then redeploy and see if all 3 replicas schedule. If they do, your topology (zone/region/hostname) doesn't have enough distinct values for your anti-affinity constraint.

Follow-up: You change to soft anti-affinity. All 3 replicas schedule, but 2 end up on the same node. Is this expected?

You add PodTopologySpread constraints to ensure replicas are spread across zones. But the first pod schedules in zone-a, and all subsequent replicas also schedule in zone-a instead of spreading. Why isn't the spread working?

Check the maxSkew value and the constraint type. kubectl get pod -o yaml | grep -A15 "topologySpreadConstraints". If maxSkew=1, it means replicas should be distributed so no zone has more than (min_replicas + 1) pods.

Example: 3 replicas, 3 zones, maxSkew=1. The scheduler tries to give each zone 1 replica, but if it places all 3 in zone-a, the skew is 3-0 (max - min replicas per zone), which exceeds maxSkew=1. This should violate the constraint.

Unless the constraint type is "soft" (preferredDuringSchedulingIgnoredDuringExecution). If soft, the scheduler tries to spread but allows violations if there aren't better options. Check: whenUnsatisfiable field should be DoNotSchedule (hard) or ScheduleAnyway (soft). If it's ScheduleAnyway, violations are tolerated.

Also check the topology key—is it topology.kubernetes.io/zone? If nodes don't have this label, the scheduler can't group them by zone, so "spreading across zones" doesn't work. Verify node labels: kubectl get nodes --show-labels | grep topology. If no zone labels exist, every node is its own topology, and spread means "different node" not "different zone."

If the constraint is correct and nodes are labeled, but pods still clump, describe the pod and check for other constraints (affinity rules) that might override the spread. Sometimes affinity to a specific node conflicts with spreading constraints, and the scheduler picks the higher-priority constraint.

Follow-up: PodTopologySpread is set to hard (DoNotSchedule). The replicas should spread across zones but don't. You inspect the nodes and they all have zone labels. The spread constraint looks correct. Why is it still not working?

A critical pod gets preempted and moved off a node to make room for a higher-priority pod. Now the critical pod is Pending on another node. How do you prevent this scenario?

Preemption happens when the scheduler has no available node for a high-priority pod and decides to evict lower-priority pods to make room. The evicted pod is moved to Pending, and if no other node can fit it, it stays Pending.

To prevent preemption: (1) set priorityClassName on your critical pod to be at least as high as any other pod that might preempt it, (2) add a PodDisruptionBudget so your pod can't be evicted without triggering the budget (which the scheduler respects when possible), (3) use priority and preemption policy to disable preemption entirely (if acceptable).

Check the priority classes in your cluster: kubectl get priorityclasses. Assign a high priority class to critical pods. Then, even if a higher-priority pod comes in, the scheduler won't preempt your pod if there's an alternative (other nodes, scale-up, etc.).

To see if a pod was preempted: kubectl describe pod <name> | grep -i "preempt\|evict". If the event says "preempted," that's what happened.

Note: PodDisruptionBudget doesn't prevent preemption—it only prevents certain types of evictions (admin-initiated drains, cluster autoscaler evictions). Preemption is a scheduler decision and bypasses PDB checks. To truly prevent preemption, use the PriorityClass's preemptionPolicy field: set it to Never to disallow preemption for that priority class entirely.

Follow-up: You set preemptionPolicy: Never on your high-priority pod. A critical pod still gets preempted. How is this possible?