[ 
https://issues.apache.org/jira/browse/YUNIKORN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-3119:
--------------------------------------------
    Target Version: 1.9.0  (was: 1.8.0)

> Add Metrics for Monitoring Applications and Nodes Attempted in Each 
> Scheduling Cycle
> ------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3119
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3119
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Summary
> Add new observability metrics to track the number of applications and nodes 
> attempted during each scheduling cycle. This enhancement will improve 
> debugging capabilities for scheduling latency issues by providing visibility 
> into scheduling cycle efficiency and application processing patterns.
> h2. Background
> When debugging YuniKorn scheduling performance issues, it's important to 
> understand not just how long scheduling takes, but also how many applications 
> are being processed in each cycle and how many node evaluation did it take to 
> reach the conclusion. Currently, YuniKorn logs timing information but lacks 
> visibility into the number of applications and nodes attempted per scheduling 
> cycle, making it difficult to correlate scheduling latency with workload 
> characteristics.
> h2. Proposed Solution
> Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and 
> reports the number of applications and nodes attempted during each scheduling 
> cycle. This metric will be integrated into existing logging and monitoring 
> infrastructure.
> h3. Key Features:
>  # {*}Applications Attempted Counter{*}: Track the number of applications 
> processed in each scheduling cycle
>  # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current 
> timing and allocation metrics
>  # {*}Debugging Support{*}: Provide correlation data between application 
> count and scheduling latency
>  # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect 
> scheduling performance
> h3. Implementation Details
>  # {*}Counter Integration{*}: Add application counter in the main scheduling 
> loop
>  # {*}Metrics Collection{*}: Integrate with existing Prometheus metrics 
> infrastructure
>  # {*}Logging Enhancement{*}: Include metric in structured logging output
>  # {*}Documentation{*}: Update monitoring and debugging documentation
> h3. Monitoring Integration
>  * Add new Prometheus metric: 
> {{yunikorn_scheduler_applications_attempted_per_cycle}} 
> {{yunikorn_scheduler_nodes_attempted_per_cycle}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to