Workload Rebalancing (Descheduler)
Kubernetes scheduling is a point-in-time decision. When a Pod is created, the kube-scheduler selects the most appropriate node based on the cluster state at that moment. As the cluster changes, that original decision can become stale: nodes can become over- or under-utilized, taints or labels can change, affinity rules can be updated, failed nodes can recover, and new nodes can be added.
The Alauda Build of Descheduler plugin helps rebalance your cluster by identifying running Pods that violate scheduling policy or are placed on less suitable nodes, evicting those Pods, and allowing the default scheduler to place the replacement Pods on more appropriate nodes.
The descheduler does not schedule replacement Pods itself. It only calls the Kubernetes eviction API. The default scheduler reschedules Pods that are managed by controllers such as Deployments, ReplicaSets, StatefulSets, or Jobs.
Typical scenarios for using the descheduler include:
- Nodes are underutilized or overutilized.
- Node labels, node taints, Pod affinity, or Pod anti-affinity requirements changed after Pods were scheduled.
- Node failures moved Pods onto a smaller set of nodes, and recovered nodes should receive workload again.
- New nodes were added and existing workloads should be redistributed.
- Pods have restarted too many times or have been running for longer than the expected lifecycle.
TOC
Pod Eviction Rules and EligibilityUnderstanding Descheduler StrategiesInstalling Descheduler in ACPConfiguring Scheduling PoliciesRuntime ModeEditing the Policy YAMLExample: Rebalancing Overutilized NodesCommon Policy CustomizationsExample: Lifecycle and Health CleanupExample: Affinity, Taints, and Topology DriftVerifying Installation and Evictions1. Check Plugin Installation Status2. Verify the Policy ConfigMap3. View Descheduler Runs and Logs4. Check Pod EventsOperational GuidancePod Eviction Rules and Eligibility
To maintain cluster stability, the descheduler must be conservative about which Pods it evicts. Keep these protections in place unless you have verified that the workload will be recreated safely and that eviction is operationally acceptable.
- Protected Pods:
- Pods in system or platform namespaces, such as
kube-systemand ACP platform namespaces. If you add namespace include or exclude rules, keep platform namespaces excluded. - Critical Pods with
priorityClassNameset tosystem-cluster-criticalorsystem-node-critical. - Static Pods, mirrored Pods, or standalone Pods that are not managed by a controller, because these Pods will not be recreated automatically.
- Pods associated with DaemonSets.
- Pods with local storage, unless the policy explicitly disables the
PodsWithLocalStorageprotection.
- Pods in system or platform namespaces, such as
- Pod Disruption Budgets (PDBs): The descheduler uses the eviction subresource. If evicting a Pod would violate its PDB, the Pod is not evicted.
- Eviction Order: When multiple Pods are eligible, lower-priority Pods are selected before higher-priority Pods. For Pods with the same priority, BestEffort Pods are evicted before Burstable and Guaranteed Pods.
- Explicit Eviction Override: A Pod annotated with
descheduler.alpha.kubernetes.io/evictis eligible for eviction even when some internal descheduler checks would normally skip it. This annotation does not bypass PDB protection. Use it only when you know how the Pod will be recreated. - Eviction Preference: A Pod annotated with
descheduler.alpha.kubernetes.io/prefer-no-evictionasks the descheduler to avoid evicting it. Whether this is advisory or mandatory depends on theDefaultEvictornoEvictionPolicysetting.
Understanding Descheduler Strategies
The descheduler policy enables strategy plugins. Use the strategy that matches the operational goal; avoid enabling strategies that pull the cluster in opposite directions, such as spreading workloads with LowNodeUtilization and compacting workloads with HighNodeUtilization in the same policy profile.
Common strategy groups:
Installing Descheduler in ACP
The Alauda Build of Descheduler is packaged and managed as a Cluster Plugin in ACP.
-
Upload the Plugin Package:
- Obtain the
Alauda Build of Deschedulerplugin package from the Alauda Customer Portal. - Publish it to the platform using the
violettool. For detailed CLI instructions, refer to CLI Tools. - Navigate to Administrator > Marketplace > Upload Packages and verify the package is present under the Cluster Plugin tab.
- Obtain the
-
Install the Plugin:
- Navigate to Administrator > Marketplace > Cluster Plugins.
- Select the target cluster, find the Alauda Build of Descheduler plugin, and click Install.
- Adjust installation-time configuration options in the dynamic form if required, then confirm the installation.
Configuring Scheduling Policies
After the plugin is installed, do not edit Helm values directly. The installed plugin renders Kubernetes resources, and the runtime descheduler policy is stored as YAML in the descheduler ConfigMap.
Runtime Mode
- CronJob (Recommended): The descheduler runs periodically as a Job. This mode avoids running a persistent agent when the cluster state is stable. Updated policy YAML is loaded by the next scheduled Job.
- Deployment: The descheduler runs as a continuous Pod and reconciles the cluster at the interval configured during plugin installation. After updating the policy YAML, restart the descheduler Pod so the running process reloads the configuration.
Editing the Policy YAML
Locate the descheduler policy ConfigMap:
Back up the current policy before editing it:
Edit the ConfigMap returned by the previous command:
In data.policy.yaml, keep the existing policy and merge only the fields you need to change. Do not replace the whole policy with an example, because doing so can remove existing protections, enabled strategies, namespace filters, or eviction limits.
Example: Rebalancing Overutilized Nodes
This example enables LowNodeUtilization and sets the same threshold pattern commonly used for spread-style descheduling: nodes below 20% CPU, memory, and Pod capacity are underutilized; nodes above 50% for any of those resources are overutilized.
Relevant policy fragment:
Notes:
thresholdsandtargetThresholdsmust define the same resource keys.- The valid percentage range is
0to100. thresholdsmust not be greater thantargetThresholdsfor the same resource.- The strategy only runs when at least one underutilized node and one overutilized node exist.
- By default, node utilization is calculated from Pod resource requests and node allocatable resources. If you need actual-usage-based descheduling, configure supported metrics providers and
metricsUtilizationin the policy.
Common Policy Customizations
Example: Lifecycle and Health Cleanup
This fragment evicts Pods older than 24 hours and Pods whose containers have restarted more than 100 times:
Example: Affinity, Taints, and Topology Drift
This fragment evicts Pods that no longer match updated node taints, required node affinity, inter-Pod anti-affinity, or hard topology spread constraints:
Verifying Installation and Evictions
1. Check Plugin Installation Status
Verify that the ModuleInfo has transitioned to the Running state:
2. Verify the Policy ConfigMap
Check that the policy YAML contains the expected strategies and protections:
3. View Descheduler Runs and Logs
If running as a CronJob, list the completed or running Jobs:
If running as a Deployment, confirm the running Pod and restart it after policy changes:
Retrieve descheduler logs to check node evaluation, skipped Pods, and eviction actions:
Example eviction log:
4. Check Pod Events
When a Pod is evicted by the descheduler, inspect Pod events and the Pod description. The event reason can vary by strategy, so do not rely on a single reason=Descheduled filter.
Operational Guidance
- Start in a narrow scope: limit namespaces, set conservative eviction limits, and verify logs before broadening the policy.
- Ensure workloads have controllers and sufficient replicas before enabling strategies that may evict many Pods.
- Keep PDBs current for critical applications. The descheduler respects PDBs, but missing PDBs provide no disruption budget.
- Use
HighNodeUtilizationonly when the scheduler or autoscaler is configured to compact Pods. Otherwise, evicted Pods may be spread again. - Do not disable local-storage, DaemonSet, system-critical, or standalone-Pod protections unless the workload has an explicit, tested recovery path.
- For actual-usage-based decisions, confirm that metrics providers are configured and that the selected strategy consumes those metrics; otherwise utilization strategies are based on requests.