[ https://issues.apache.org/jira/browse/FLINK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701030#comment-17701030 ]
Martijn Visser commented on FLINK-31482: ---------------------------------------- Wouldn't you normally detect these type of things in your metric system like Prometheus or Grafana? > support count jobmanager-failed failover times > ---------------------------------------------- > > Key: FLINK-31482 > URL: https://issues.apache.org/jira/browse/FLINK-31482 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Metrics > Affects Versions: 1.16.1 > Reporter: Fei Feng > Priority: Major > > we have a metric `numRestarts` which indicate how many times a job failover > , but we don't have a metric indicate the job recover from ha ( high > availability). > there are two problems: > 1. when a jobmanager process crashed , we have no way of knowing that > jobmanager is crash and job was recovered from metric system > 2. when a new jobmanager become leader, the `numRestarts` will started from > zero, > Sometimes misleading our users。most user think that whether failover because > of a JM failure or because of a job failure, these failover is same , the > effect, at least, is the same. > > I suggest we can > 1. add new metric that indicate how many time the job was recovered from ha > 2. metric `numRestarts` also count the times recover from ha > > -- This message was sent by Atlassian Jira (v8.20.10#820010)