[ 
https://issues.apache.org/jira/browse/YUNIKORN-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YUNIKORN-3120:
--------------------------------
    Description: 
h3. Summary

Enhance the existing scheduling latency metrics by adding state labels to 
distinguish between scheduling cycles that result in successful pod allocation 
versus cycles that don't find suitable allocations. This improvement will 
significantly enhance debugging capabilities for scheduling performance issues.
h3. Background

Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}} 
metric aggregates all scheduling cycles together, making it difficult to 
distinguish between:
 # {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and 
allocates resources for pending applications
 # {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot find 
suitable allocations due to resource constraints, policy restrictions, or other 
factors

This lack of distinction makes it challenging to debug scheduling latency 
issues, as operators cannot easily identify whether high latency is due to 
complex allocation decisions or repeated failed allocation attempts.
h3. Implementation Details
 # {*}Metric Enhancement{*}: Add state label to existing histogram metric
 # {*}Cycle Tracking{*}: Track allocation success/failure in scheduling loop
 # {*}Threshold Logging{*}: Configurable threshold for detailed non-allocation 
logging
 # {*}Documentation{*}: Update monitoring guides and dashboard examples

h3. Backward Compatibility
 * Existing metric queries continue to work unchanged
 * Additive enhancement that doesn't break existing monitoring setups
 * Optional detailed logging that can be configured based on operational needs

  was:
h2. Summary

Enhance the existing scheduling latency metrics by adding state labels to 
distinguish between scheduling cycles that result in successful pod allocation 
versus cycles that don't find suitable allocations. This improvement will 
significantly enhance debugging capabilities for scheduling performance issues.
h2. Background

Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}} 
metric aggregates all scheduling cycles together, making it difficult to 
distinguish between:
 # {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and 
allocates resources for pending applications
 # {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot find 
suitable allocations due to resource constraints, policy restrictions, or other 
factors

This lack of distinction makes it challenging to debug scheduling latency 
issues, as operators cannot easily identify whether high latency is due to 
complex allocation decisions or repeated failed allocation attempts.


> Enhance Scheduling Latency Metrics with Allocation State Labels
> ---------------------------------------------------------------
>
>                 Key: YUNIKORN-3120
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3120
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>
> h3. Summary
> Enhance the existing scheduling latency metrics by adding state labels to 
> distinguish between scheduling cycles that result in successful pod 
> allocation versus cycles that don't find suitable allocations. This 
> improvement will significantly enhance debugging capabilities for scheduling 
> performance issues.
> h3. Background
> Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}} 
> metric aggregates all scheduling cycles together, making it difficult to 
> distinguish between:
>  # {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and 
> allocates resources for pending applications
>  # {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot 
> find suitable allocations due to resource constraints, policy restrictions, 
> or other factors
> This lack of distinction makes it challenging to debug scheduling latency 
> issues, as operators cannot easily identify whether high latency is due to 
> complex allocation decisions or repeated failed allocation attempts.
> h3. Implementation Details
>  # {*}Metric Enhancement{*}: Add state label to existing histogram metric
>  # {*}Cycle Tracking{*}: Track allocation success/failure in scheduling loop
>  # {*}Threshold Logging{*}: Configurable threshold for detailed 
> non-allocation logging
>  # {*}Documentation{*}: Update monitoring guides and dashboard examples
> h3. Backward Compatibility
>  * Existing metric queries continue to work unchanged
>  * Additive enhancement that doesn't break existing monitoring setups
>  * Optional detailed logging that can be configured based on operational needs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to