[
https://issues.apache.org/jira/browse/YUNIKORN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-3119:
--------------------------------------------
Target Version: 1.9.0 (was: 1.8.0)
> Add Metrics for Monitoring Applications and Nodes Attempted in Each
> Scheduling Cycle
> ------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3119
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3119
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
> Labels: pull-request-available
>
> h2. Summary
> Add new observability metrics to track the number of applications and nodes
> attempted during each scheduling cycle. This enhancement will improve
> debugging capabilities for scheduling latency issues by providing visibility
> into scheduling cycle efficiency and application processing patterns.
> h2. Background
> When debugging YuniKorn scheduling performance issues, it's important to
> understand not just how long scheduling takes, but also how many applications
> are being processed in each cycle and how many node evaluation did it take to
> reach the conclusion. Currently, YuniKorn logs timing information but lacks
> visibility into the number of applications and nodes attempted per scheduling
> cycle, making it difficult to correlate scheduling latency with workload
> characteristics.
> h2. Proposed Solution
> Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and
> reports the number of applications and nodes attempted during each scheduling
> cycle. This metric will be integrated into existing logging and monitoring
> infrastructure.
> h3. Key Features:
> # {*}Applications Attempted Counter{*}: Track the number of applications
> processed in each scheduling cycle
> # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current
> timing and allocation metrics
> # {*}Debugging Support{*}: Provide correlation data between application
> count and scheduling latency
> # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect
> scheduling performance
> h3. Implementation Details
> # {*}Counter Integration{*}: Add application counter in the main scheduling
> loop
> # {*}Metrics Collection{*}: Integrate with existing Prometheus metrics
> infrastructure
> # {*}Logging Enhancement{*}: Include metric in structured logging output
> # {*}Documentation{*}: Update monitoring and debugging documentation
> h3. Monitoring Integration
> * Add new Prometheus metric:
> {{yunikorn_scheduler_applications_attempted_per_cycle}}
> {{yunikorn_scheduler_nodes_attempted_per_cycle}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]