[
https://issues.apache.org/jira/browse/YUNIKORN-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-3118:
--------------------------------------------
Target Version: 1.9.0 (was: 1.8.0)
> Implement Parallel TryNode Evaluation for Improved Scheduling Performance
> -------------------------------------------------------------------------
>
> Key: YUNIKORN-3118
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3118
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
> Labels: pull-request-available
>
> h3. Summary
> Implement parallel evaluation of nodes during the scheduling process to
> significantly improve scheduling latency in large clusters. This enhancement
> introduces configurable parallelization of the TryNode evaluation process
> while maintaining backward compatibility.
> h3. Background
> In large Kubernetes clusters with many nodes, the current sequential node
> evaluation process can become a bottleneck during scheduling. Each allocation
> request must evaluate nodes one by one, leading to increased scheduling
> latency, especially when dealing with multiple pending pods.
> h3. Proposed Solution
> Add a new configuration parameter `trynodesthreadcount` that allows us to
> configure the number of parallel threads used for node evaluation during
> scheduling.
> h3. Key Features:
> 1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in
> partition configuration
> 2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1)
> when not configured
> 3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
> 4. {*}Performance Optimization{*}: Implements dry-run evaluation before
> actual allocation attempts
> Configuration Example:
> {code:yaml}
> partitions:
> name: default
> trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
> queues: name: root
> {code}
> h3. Expected Benefits
> - _4-5x improvement_ in scheduling latency for clusters with many nodes
> - Better resource utilization during scheduling cycles
> - Improved user experience with faster pod startup times
> - Scalability improvements for large clusters
> h4. Performance Testing Results
> Based on internal testing:
> - Small workloads (5 pods): 4.5x faster scheduling
> - Medium workloads (25 pods): 5x faster scheduling
> - Large batch scenarios: 4x improvement in processing time
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]