[
https://issues.apache.org/jira/browse/YUNIKORN-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mit Desai updated YUNIKORN-3120:
--------------------------------
Description:
h3. Summary
Enhance the existing scheduling latency metrics by adding state labels to
distinguish between scheduling cycles that result in successful pod allocation
versus cycles that don't find suitable allocations. This improvement will
significantly enhance debugging capabilities for scheduling performance issues.
h3. Background
Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}}
metric aggregates all scheduling cycles together, making it difficult to
distinguish between:
# {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and
allocates resources for pending applications
# {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot find
suitable allocations due to resource constraints, policy restrictions, or other
factors
This lack of distinction makes it challenging to debug scheduling latency
issues, as operators cannot easily identify whether high latency is due to
complex allocation decisions or repeated failed allocation attempts.
h3. Implementation Details
# {*}Metric Enhancement{*}: Add state label to existing histogram metric
# {*}Cycle Tracking{*}: Track allocation success/failure in scheduling loop
# {*}Threshold Logging{*}: Configurable threshold for detailed non-allocation
logging
# {*}Documentation{*}: Update monitoring guides and dashboard examples
h3. Backward Compatibility
* Existing metric queries continue to work unchanged
* Additive enhancement that doesn't break existing monitoring setups
* Optional detailed logging that can be configured based on operational needs
was:
h2. Summary
Enhance the existing scheduling latency metrics by adding state labels to
distinguish between scheduling cycles that result in successful pod allocation
versus cycles that don't find suitable allocations. This improvement will
significantly enhance debugging capabilities for scheduling performance issues.
h2. Background
Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}}
metric aggregates all scheduling cycles together, making it difficult to
distinguish between:
# {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and
allocates resources for pending applications
# {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot find
suitable allocations due to resource constraints, policy restrictions, or other
factors
This lack of distinction makes it challenging to debug scheduling latency
issues, as operators cannot easily identify whether high latency is due to
complex allocation decisions or repeated failed allocation attempts.
> Enhance Scheduling Latency Metrics with Allocation State Labels
> ---------------------------------------------------------------
>
> Key: YUNIKORN-3120
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3120
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
>
> h3. Summary
> Enhance the existing scheduling latency metrics by adding state labels to
> distinguish between scheduling cycles that result in successful pod
> allocation versus cycles that don't find suitable allocations. This
> improvement will significantly enhance debugging capabilities for scheduling
> performance issues.
> h3. Background
> Currently, YuniKorn's {{yunikorn_scheduler_scheduling_latency_milliseconds}}
> metric aggregates all scheduling cycles together, making it difficult to
> distinguish between:
> # {*}Allocation cycles{*}: Cycles where the scheduler successfully finds and
> allocates resources for pending applications
> # {*}Non-allocation cycles{*}: Cycles where the scheduler runs but cannot
> find suitable allocations due to resource constraints, policy restrictions,
> or other factors
> This lack of distinction makes it challenging to debug scheduling latency
> issues, as operators cannot easily identify whether high latency is due to
> complex allocation decisions or repeated failed allocation attempts.
> h3. Implementation Details
> # {*}Metric Enhancement{*}: Add state label to existing histogram metric
> # {*}Cycle Tracking{*}: Track allocation success/failure in scheduling loop
> # {*}Threshold Logging{*}: Configurable threshold for detailed
> non-allocation logging
> # {*}Documentation{*}: Update monitoring guides and dashboard examples
> h3. Backward Compatibility
> * Existing metric queries continue to work unchanged
> * Additive enhancement that doesn't break existing monitoring setups
> * Optional detailed logging that can be configured based on operational needs
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]