zhuzhurk commented on a change in pull request #10082: [FLINK-14164][runtime] Add a meter ‘numberOfRestarts’ to show number of restarts as well as its rate URL: https://github.com/apache/flink/pull/10082#discussion_r342911786
########## File path: flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java ########## @@ -193,6 +197,11 @@ public SchedulerBase( this.failoverTopology = executionGraph.getFailoverTopology(); this.inputsLocationsRetriever = new ExecutionGraphToInputsLocationsRetrieverAdapter(executionGraph); + + // Use the counter from execution graph to avoid modifying execution graph interfaces + // Can be a new SimpleCounter created here after the legacy scheduler is removed. + this.numberOfRestartsCounter = executionGraph.getNumberOfRestartsCounter(); + jobManagerJobMetricGroup.meter(NUMBER_OF_RESTARTS, new MeterView(numberOfRestartsCounter)); Review comment: Yes the rate is awkward if the event happens in a very low frequency. I think a counter `numberOfRestarts` is needed to enable users to build alerts in a more flexible way. And the question is: Whether to introduce a meter `numberOfRestartsPerSecond`? - Pros: The meter enables users to build alerts for restarts even if their monitoring system does not support variations of values. - Cons: The integral of rate value is not accurate so that users cannot use it to build reliable alerts other than ">0". This is limited by the time interval used to sample metrics in Flink, as well as in the external metric collecting system. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services