[
https://issues.apache.org/jira/browse/YUNIKORN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mit Desai updated YUNIKORN-3122:
--------------------------------
Summary: Optimize Node Evaluation by Pre-filtering Tainted Nodes (was:
Optimize Node Evaluation by Pre-filtering Tainted Nodes Based on Pod
Tolerations)
> Optimize Node Evaluation by Pre-filtering Tainted Nodes
> -------------------------------------------------------
>
> Key: YUNIKORN-3122
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3122
> Project: Apache YuniKorn
> Issue Type: New Feature
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
>
> h3. Summary
> Implement intelligent node pre-filtering in YuniKorn to avoid evaluating
> nodes with taints that don't match pod tolerations during scheduling cycles.
> This optimization will significantly improve scheduling performance in large
> clusters with mixed tainted and untainted nodes.
> h3. Background
> In large Kubernetes clusters with node taints for workload isolation (e.g.,
> different nodepools for security, compliance, or resource requirements),
> YuniKorn currently evaluates all nodes during each scheduling cycle
> regardless of whether the pod being scheduled has the necessary tolerations.
> This leads to significant performance overhead as the scheduler:
> # {*}Evaluates Incompatible Nodes{*}: Processes nodes that will ultimately
> be rejected due to taint mismatches
> # {*}Wastes CPU Cycles{*}: Performs unnecessary predicate checks on
> unsuitable nodes
> # {*}Increases Scheduling Latency{*}: Adds overhead proportional to the
> number of tainted nodes in the cluster
> h4. Real-World Impact
> * {*}Cluster Size{*}: 700+ nodes with ~50% having nodepool taints
> * {*}Current Behavior{*}: Scheduler evaluates all 700 nodes per cycle
> * {*}Optimal Behavior{*}: Should only evaluate ~350 compatible nodes per
> cycle
> * {*}Performance Gain{*}: Potential 50% reduction in node evaluation overhead
> YuniKorn's current node evaluation logic:
> {code:go}
> // Current inefficient approach
> for each_node in cluster {
> evaluate_node_predicates(node, pod) // Includes taint/toleration check
> if predicates_pass {
> attempt_allocation(node, pod)
> }
> }
> {code}
> This approach evaluates nodes that are guaranteed to fail taint/toleration
> checks, wasting computational resources.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]