[ 
https://issues.apache.org/jira/browse/YUNIKORN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-3122:
-----------------------------------
    Target Version: 1.9.0  (was: 1.8.0)

> Optimize Node Evaluation by Pre-filtering Tainted Nodes
> -------------------------------------------------------
>
>                 Key: YUNIKORN-3122
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3122
>             Project: Apache YuniKorn
>          Issue Type: New Feature
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>
> h3. Summary
> Implement intelligent node pre-filtering in YuniKorn to avoid evaluating 
> nodes with taints that don't match pod tolerations during scheduling cycles. 
> This optimization will significantly improve scheduling performance in large 
> clusters with mixed tainted and untainted nodes.
> h3. Background
> In large Kubernetes clusters with node taints for workload isolation (e.g., 
> different nodepools for security, compliance, or resource requirements), 
> YuniKorn currently evaluates all nodes during each scheduling cycle 
> regardless of whether the pod being scheduled has the necessary tolerations. 
> This leads to significant performance overhead as the scheduler:
>  # {*}Evaluates Incompatible Nodes{*}: Processes nodes that will ultimately 
> be rejected due to taint mismatches
>  # {*}Wastes CPU Cycles{*}: Performs unnecessary predicate checks on 
> unsuitable nodes
>  # {*}Increases Scheduling Latency{*}: Adds overhead proportional to the 
> number of tainted nodes in the cluster
> h4. Real-World Impact
>  * {*}Cluster Size{*}: 700+ nodes with ~50% having nodepool taints
>  * {*}Current Behavior{*}: Scheduler evaluates all 700 nodes per cycle
>  * {*}Optimal Behavior{*}: Should only evaluate ~350 compatible nodes per 
> cycle
>  * {*}Performance Gain{*}: Potential 50% reduction in node evaluation overhead
> YuniKorn's current node evaluation logic:
> {code:go}
> // Current inefficient approach
> for each_node in cluster {
>     evaluate_node_predicates(node, pod)  // Includes taint/toleration check
>     if predicates_pass {
>         attempt_allocation(node, pod)
>     }
> }
> {code}
> This approach evaluates nodes that are guaranteed to fail taint/toleration 
> checks, wasting computational resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to