zhuzhurk commented on a change in pull request #10082: [FLINK-14164][runtime] 
Add a meter ‘numberOfRestarts’ to show number of restarts as well as its rate
URL: https://github.com/apache/flink/pull/10082#discussion_r342911786
 
 

 ##########
 File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java
 ##########
 @@ -193,6 +197,11 @@ public SchedulerBase(
                this.failoverTopology = executionGraph.getFailoverTopology();
 
                this.inputsLocationsRetriever = new 
ExecutionGraphToInputsLocationsRetrieverAdapter(executionGraph);
+
+               // Use the counter from execution graph to avoid modifying 
execution graph interfaces
+               // Can be a new SimpleCounter created here after the legacy 
scheduler is removed.
+               this.numberOfRestartsCounter = 
executionGraph.getNumberOfRestartsCounter();
+               jobManagerJobMetricGroup.meter(NUMBER_OF_RESTARTS, new 
MeterView(numberOfRestartsCounter));
 
 Review comment:
   Yes the rate is awkward if the event happens in a very low frequency.
   I think a counter `numberOfRestarts` is needed to enable users to build 
alerts in a more flexible way.
   And the question is: Whether to introduce a meter 
`numberOfRestartsPerSecond`?
   - Pros: The meter enables users to build alerts for restarts even if their 
monitoring system does not support variations of values. 
   - Cons: The integral of rate value is not accurate so that users cannot use 
it to build reliable alerts other than ">0". This is limited by the time 
interval used to sample metrics in Flink, as well as in the external metric 
collecting system.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to