[
https://issues.apache.org/jira/browse/YUNIKORN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mit Desai updated YUNIKORN-3119:
--------------------------------
Description:
h2. Summary
Add new observability metrics to track the number of applications and nodes
attempted during each scheduling cycle. This enhancement will improve debugging
capabilities for scheduling latency issues by providing visibility into
scheduling cycle efficiency and application processing patterns.
h2. Background
When debugging YuniKorn scheduling performance issues, it's important to
understand not just how long scheduling takes, but also how many applications
are being processed in each cycle and how many node evaluation did it take to
reach the conclusion. Currently, YuniKorn logs timing information but lacks
visibility into the number of applications and nodes attempted per scheduling
cycle, making it difficult to correlate scheduling latency with workload
characteristics.
h2. Proposed Solution
Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and
reports the number of applications and nodes attempted during each scheduling
cycle. This metric will be integrated into existing logging and monitoring
infrastructure.
h3. Key Features:
# {*}Applications Attempted Counter{*}: Track the number of applications
processed in each scheduling cycle
# {*}Integration with Existing Metrics{*}: Seamlessly integrate with current
timing and allocation metrics
# {*}Debugging Support{*}: Provide correlation data between application count
and scheduling latency
# {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect
scheduling performance
h3. Implementation Details
# {*}Counter Integration{*}: Add application counter in the main scheduling
loop
# {*}Metrics Collection{*}: Integrate with existing Prometheus metrics
infrastructure
# {*}Logging Enhancement{*}: Include metric in structured logging output
# {*}Documentation{*}: Update monitoring and debugging documentation
h3. Monitoring Integration
* Add new Prometheus metric:
{{yunikorn_scheduler_applications_attempted_per_cycle}}
{{yunikorn_scheduler_nodes_attempted_per_cycle}}
was:
h2. Summary
Add new observability metrics to track the number of applications and nodes
attempted during each scheduling cycle. This enhancement will improve debugging
capabilities for scheduling latency issues by providing visibility into
scheduling cycle efficiency and application processing patterns.
h2. Background
When debugging YuniKorn scheduling performance issues, it's important to
understand not just how long scheduling takes, but also how many applications
are being processed in each cycle and how many node evaluation did it take to
reach the conclusion. Currently, YuniKorn logs timing information but lacks
visibility into the number of applications and nodes attempted per scheduling
cycle, making it difficult to correlate scheduling latency with workload
characteristics.
h2. Proposed Solution
Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and
reports the number of applications and nodes attempted during each scheduling
cycle. This metric will be integrated into existing logging and monitoring
infrastructure.
h3. Key Features:
# {*}Applications Attempted Counter{*}: Track the number of applications
processed in each scheduling cycle
# {*}Integration with Existing Metrics{*}: Seamlessly integrate with current
timing and allocation metrics
# {*}Debugging Support{*}: Provide correlation data between application count
and scheduling latency
# {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect
scheduling performance
> Add Metrics for Monitoring Applications and Nodes Attempted in Each
> Scheduling Cycle
> ------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3119
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3119
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Mit Desai
> Assignee: Mit Desai
> Priority: Major
>
> h2. Summary
> Add new observability metrics to track the number of applications and nodes
> attempted during each scheduling cycle. This enhancement will improve
> debugging capabilities for scheduling latency issues by providing visibility
> into scheduling cycle efficiency and application processing patterns.
> h2. Background
> When debugging YuniKorn scheduling performance issues, it's important to
> understand not just how long scheduling takes, but also how many applications
> are being processed in each cycle and how many node evaluation did it take to
> reach the conclusion. Currently, YuniKorn logs timing information but lacks
> visibility into the number of applications and nodes attempted per scheduling
> cycle, making it difficult to correlate scheduling latency with workload
> characteristics.
> h2. Proposed Solution
> Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and
> reports the number of applications and nodes attempted during each scheduling
> cycle. This metric will be integrated into existing logging and monitoring
> infrastructure.
> h3. Key Features:
> # {*}Applications Attempted Counter{*}: Track the number of applications
> processed in each scheduling cycle
> # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current
> timing and allocation metrics
> # {*}Debugging Support{*}: Provide correlation data between application
> count and scheduling latency
> # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect
> scheduling performance
> h3. Implementation Details
> # {*}Counter Integration{*}: Add application counter in the main scheduling
> loop
> # {*}Metrics Collection{*}: Integrate with existing Prometheus metrics
> infrastructure
> # {*}Logging Enhancement{*}: Include metric in structured logging output
> # {*}Documentation{*}: Update monitoring and debugging documentation
> h3. Monitoring Integration
> * Add new Prometheus metric:
> {{yunikorn_scheduler_applications_attempted_per_cycle}}
> {{yunikorn_scheduler_nodes_attempted_per_cycle}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]