[ 
https://issues.apache.org/jira/browse/YUNIKORN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YUNIKORN-3119:
--------------------------------
    Description: 
h2. Summary

Add new observability metrics to track the number of applications and nodes 
attempted during each scheduling cycle. This enhancement will improve debugging 
capabilities for scheduling latency issues by providing visibility into 
scheduling cycle efficiency and application processing patterns.
h2. Background

When debugging YuniKorn scheduling performance issues, it's important to 
understand not just how long scheduling takes, but also how many applications 
are being processed in each cycle and how many node evaluation did it take to 
reach the conclusion. Currently, YuniKorn logs timing information but lacks 
visibility into the number of applications and nodes attempted per scheduling 
cycle, making it difficult to correlate scheduling latency with workload 
characteristics.
h2. Proposed Solution

Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and 
reports the number of applications and nodes attempted during each scheduling 
cycle. This metric will be integrated into existing logging and monitoring 
infrastructure.
h3. Key Features:
 # {*}Applications Attempted Counter{*}: Track the number of applications 
processed in each scheduling cycle
 # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current 
timing and allocation metrics
 # {*}Debugging Support{*}: Provide correlation data between application count 
and scheduling latency
 # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect 
scheduling performance

h3. Implementation Details
 # {*}Counter Integration{*}: Add application counter in the main scheduling 
loop
 # {*}Metrics Collection{*}: Integrate with existing Prometheus metrics 
infrastructure
 # {*}Logging Enhancement{*}: Include metric in structured logging output
 # {*}Documentation{*}: Update monitoring and debugging documentation

h3. Monitoring Integration
 * Add new Prometheus metric: 
{{yunikorn_scheduler_applications_attempted_per_cycle}} 
{{yunikorn_scheduler_nodes_attempted_per_cycle}}

  was:
h2. Summary

Add new observability metrics to track the number of applications and nodes 
attempted during each scheduling cycle. This enhancement will improve debugging 
capabilities for scheduling latency issues by providing visibility into 
scheduling cycle efficiency and application processing patterns.
h2. Background

When debugging YuniKorn scheduling performance issues, it's important to 
understand not just how long scheduling takes, but also how many applications 
are being processed in each cycle and how many node evaluation did it take to 
reach the conclusion. Currently, YuniKorn logs timing information but lacks 
visibility into the number of applications and nodes attempted per scheduling 
cycle, making it difficult to correlate scheduling latency with workload 
characteristics.
h2. Proposed Solution

Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and 
reports the number of applications and nodes attempted during each scheduling 
cycle. This metric will be integrated into existing logging and monitoring 
infrastructure.
h3. Key Features:
 # {*}Applications Attempted Counter{*}: Track the number of applications 
processed in each scheduling cycle
 # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current 
timing and allocation metrics
 # {*}Debugging Support{*}: Provide correlation data between application count 
and scheduling latency
 # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect 
scheduling performance


> Add Metrics for Monitoring Applications and Nodes Attempted in Each 
> Scheduling Cycle
> ------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3119
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3119
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Mit Desai
>            Priority: Major
>
> h2. Summary
> Add new observability metrics to track the number of applications and nodes 
> attempted during each scheduling cycle. This enhancement will improve 
> debugging capabilities for scheduling latency issues by providing visibility 
> into scheduling cycle efficiency and application processing patterns.
> h2. Background
> When debugging YuniKorn scheduling performance issues, it's important to 
> understand not just how long scheduling takes, but also how many applications 
> are being processed in each cycle and how many node evaluation did it take to 
> reach the conclusion. Currently, YuniKorn logs timing information but lacks 
> visibility into the number of applications and nodes attempted per scheduling 
> cycle, making it difficult to correlate scheduling latency with workload 
> characteristics.
> h2. Proposed Solution
> Add a new metric {{applicationsTried}} and {{nodesTried}} that tracks and 
> reports the number of applications and nodes attempted during each scheduling 
> cycle. This metric will be integrated into existing logging and monitoring 
> infrastructure.
> h3. Key Features:
>  # {*}Applications Attempted Counter{*}: Track the number of applications 
> processed in each scheduling cycle
>  # {*}Integration with Existing Metrics{*}: Seamlessly integrate with current 
> timing and allocation metrics
>  # {*}Debugging Support{*}: Provide correlation data between application 
> count and scheduling latency
>  # {*}Minimal Performance Impact{*}: Lightweight counter that doesn't affect 
> scheduling performance
> h3. Implementation Details
>  # {*}Counter Integration{*}: Add application counter in the main scheduling 
> loop
>  # {*}Metrics Collection{*}: Integrate with existing Prometheus metrics 
> infrastructure
>  # {*}Logging Enhancement{*}: Include metric in structured logging output
>  # {*}Documentation{*}: Update monitoring and debugging documentation
> h3. Monitoring Integration
>  * Add new Prometheus metric: 
> {{yunikorn_scheduler_applications_attempted_per_cycle}} 
> {{yunikorn_scheduler_nodes_attempted_per_cycle}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to