[
https://issues.apache.org/jira/browse/YUNIKORN-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mit Desai updated YUNIKORN-3118:
--------------------------------
Description:
h3. Summary
Implement parallel evaluation of nodes during the scheduling process to
significantly improve scheduling latency in large clusters. This enhancement
introduces configurable parallelization of the TryNode evaluation process while
maintaining backward compatibility.
h3. Background
In large Kubernetes clusters with many nodes, the current sequential node
evaluation process can become a bottleneck during scheduling. Each allocation
request must evaluate nodes one by one, leading to increased scheduling
latency, especially when dealing with multiple pending pods.
h3. Proposed Solution
Add a new configuration parameter `trynodesthreadcount` that allows us to
configure the number of parallel threads used for node evaluation during
scheduling.
h3. Key Features:
1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in
partition configuration
2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1)
when not configured
3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
4. {*}Performance Optimization{*}: Implements dry-run evaluation before actual
allocation attempts
Configuration Example:
{code:yaml}
partitions:
name: default
trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
queues: name: root
{code}
h3. Expected Benefits
- __4-5x improvement__ in scheduling latency for clusters with many nodes
- Better resource utilization during scheduling cycles
- Improved user experience with faster pod startup times
- Scalability improvements for large clusters
h3. Performance Testing Results
Based on internal testing:
- Small workloads (5 pods): 4.5x faster scheduling
- Medium workloads (25 pods): 5x faster scheduling
- Large batch scenarios: 4x improvement in processing time
was:
h3. Summary
Implement parallel evaluation of nodes during the scheduling process to
significantly improve scheduling latency in large clusters. This enhancement
introduces configurable parallelization of the TryNode evaluation process while
maintaining backward compatibility.
h3. Background
In large Kubernetes clusters with many nodes, the current sequential node
evaluation process can become a bottleneck during scheduling. Each allocation
request must evaluate nodes one by one, leading to increased scheduling
latency, especially when dealing with multiple pending pods.
h3. Proposed Solution
Add a new configuration parameter `trynodesthreadcount` that allows us to
configure the number of parallel threads used for node evaluation during
scheduling.
h3. Key Features:
1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in
partition configuration
2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1)
when not configured
3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
4. {*}Performance Optimization{*}: Implements dry-run evaluation before actual
allocation attempts
Configuration Example:
{code:yaml}
partitions:
name: default
trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
queues: name: root
{code}
> Implement Parallel TryNode Evaluation for Improved Scheduling Performance
> -------------------------------------------------------------------------
>
> Key: YUNIKORN-3118
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3118
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
>
> h3. Summary
> Implement parallel evaluation of nodes during the scheduling process to
> significantly improve scheduling latency in large clusters. This enhancement
> introduces configurable parallelization of the TryNode evaluation process
> while maintaining backward compatibility.
> h3. Background
> In large Kubernetes clusters with many nodes, the current sequential node
> evaluation process can become a bottleneck during scheduling. Each allocation
> request must evaluate nodes one by one, leading to increased scheduling
> latency, especially when dealing with multiple pending pods.
> h3. Proposed Solution
> Add a new configuration parameter `trynodesthreadcount` that allows us to
> configure the number of parallel threads used for node evaluation during
> scheduling.
> h3. Key Features:
> 1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in
> partition configuration
> 2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1)
> when not configured
> 3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
> 4. {*}Performance Optimization{*}: Implements dry-run evaluation before
> actual allocation attempts
> Configuration Example:
> {code:yaml}
> partitions:
> name: default
> trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
> queues: name: root
> {code}
> h3. Expected Benefits
> - __4-5x improvement__ in scheduling latency for clusters with many nodes
> - Better resource utilization during scheduling cycles
> - Improved user experience with faster pod startup times
> - Scalability improvements for large clusters
> h3. Performance Testing Results
> Based on internal testing:
> - Small workloads (5 pods): 4.5x faster scheduling
> - Medium workloads (25 pods): 5x faster scheduling
> - Large batch scenarios: 4x improvement in processing time
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]