[ 
https://issues.apache.org/jira/browse/YUNIKORN-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YUNIKORN-3118:
--------------------------------
    Description: 
h3. Summary

Implement parallel evaluation of nodes during the scheduling process to 
significantly improve scheduling latency in large clusters. This enhancement 
introduces configurable parallelization of the TryNode evaluation process while 
maintaining backward compatibility.
h3. Background

In large Kubernetes clusters with many nodes, the current sequential node 
evaluation process can become a bottleneck during scheduling. Each allocation 
request must evaluate nodes one by one, leading to increased scheduling 
latency, especially when dealing with multiple pending pods.
h3. Proposed Solution

Add a new configuration parameter `trynodesthreadcount` that allows us to 
configure the number of parallel threads used for node evaluation during 
scheduling.
h3. Key Features:

1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in 
partition configuration
2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1) 
when not configured
3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
4. {*}Performance Optimization{*}: Implements dry-run evaluation before actual 
allocation attempts

Configuration Example:
{code:yaml}
partitions: 
name: default
trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
queues: name: root
{code}
h3. Expected Benefits

- __4-5x improvement__ in scheduling latency for clusters with many nodes
- Better resource utilization during scheduling cycles
- Improved user experience with faster pod startup times
- Scalability improvements for large clusters
h3. Performance Testing Results

Based on internal testing:

- Small workloads (5 pods): 4.5x faster scheduling
- Medium workloads (25 pods): 5x faster scheduling
- Large batch scenarios: 4x improvement in processing time

  was:
h3. Summary

Implement parallel evaluation of nodes during the scheduling process to 
significantly improve scheduling latency in large clusters. This enhancement 
introduces configurable parallelization of the TryNode evaluation process while 
maintaining backward compatibility.
h3. Background

In large Kubernetes clusters with many nodes, the current sequential node 
evaluation process can become a bottleneck during scheduling. Each allocation 
request must evaluate nodes one by one, leading to increased scheduling 
latency, especially when dealing with multiple pending pods.
h3. Proposed Solution

Add a new configuration parameter `trynodesthreadcount` that allows us to 
configure the number of parallel threads used for node evaluation during 
scheduling.
h3. Key Features:


1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in 
partition configuration
2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1) 
when not configured
3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
4. {*}Performance Optimization{*}: Implements dry-run evaluation before actual 
allocation attempts

Configuration Example:
{code:yaml}
partitions: 
name: default
trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
queues: name: root
{code}


> Implement Parallel TryNode Evaluation for Improved Scheduling Performance
> -------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3118
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3118
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>
> h3. Summary
> Implement parallel evaluation of nodes during the scheduling process to 
> significantly improve scheduling latency in large clusters. This enhancement 
> introduces configurable parallelization of the TryNode evaluation process 
> while maintaining backward compatibility.
> h3. Background
> In large Kubernetes clusters with many nodes, the current sequential node 
> evaluation process can become a bottleneck during scheduling. Each allocation 
> request must evaluate nodes one by one, leading to increased scheduling 
> latency, especially when dealing with multiple pending pods.
> h3. Proposed Solution
> Add a new configuration parameter `trynodesthreadcount` that allows us to 
> configure the number of parallel threads used for node evaluation during 
> scheduling.
> h3. Key Features:
> 1. {*}Configurable Parallelism{*}: New `trynodesthreadcount` parameter in 
> partition configuration
> 2. {*}Backward Compatibility{*}: Defaults to sequential behavior (value = 1) 
> when not configured
> 3. {*}Thread Safety{*}: Proper synchronization using goroutines and semaphores
> 4. {*}Performance Optimization{*}: Implements dry-run evaluation before 
> actual allocation attempts
> Configuration Example:
> {code:yaml}
> partitions: 
> name: default
> trynodesthreadcount: 20 # Enable parallel evaluation with 20 threads
> queues: name: root
> {code}
> h3. Expected Benefits
> - __4-5x improvement__ in scheduling latency for clusters with many nodes
> - Better resource utilization during scheduling cycles
> - Improved user experience with faster pod startup times
> - Scalability improvements for large clusters
> h3. Performance Testing Results
> Based on internal testing:
> - Small workloads (5 pods): 4.5x faster scheduling
> - Medium workloads (25 pods): 5x faster scheduling
> - Large batch scenarios: 4x improvement in processing time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to