[ 
https://issues.apache.org/jira/browse/YUNIKORN-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-3118:
--------------------------------------------
    Target Version: 1.9.0  (was: 1.8.0)

> Implement Parallel TryNode Evaluation for Improved Scheduling Performance
> -------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3118
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3118
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Summary
> Implement parallel evaluation of nodes during the scheduling process to 
> significantly improve scheduling latency in large clusters. This enhancement 
> introduces configurable parallelization of the TryNode evaluation process 
> while maintaining backward compatibility.
> h3. Background
> In large Kubernetes clusters with many nodes, the current sequential node 
> evaluation process can become a bottleneck during scheduling. Each allocation 
> request must evaluate nodes one by one, leading to increased scheduling 
> latency, especially when dealing with multiple pending pods.
> h3. Proposed Solution
> Add a new configuration parameter `trynodesthreadcount` that allows us to 
> configure the number of parallel threads used for node evaluation during 
> scheduling.
> h3. Key Features:
> 1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in 
> partition configuration
> 2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1) 
> when not configured
> 3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
> 4. {*}Performance Optimization{*}: Implements dry-run evaluation before 
> actual allocation attempts
> Configuration Example:
> {code:yaml}
> partitions: 
> name: default
> trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
> queues: name: root
> {code}
> h3. Expected Benefits
>  - _4-5x improvement_ in scheduling latency for clusters with many nodes
>  - Better resource utilization during scheduling cycles
>  - Improved user experience with faster pod startup times
>  - Scalability improvements for large clusters
> h4. Performance Testing Results
> Based on internal testing:
>  - Small workloads (5 pods): 4.5x faster scheduling
>  - Medium workloads (25 pods): 5x faster scheduling
>  - Large batch scenarios: 4x improvement in processing time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to