[jira] [Commented] (FLINK-31482) support count jobmanager-failed failover times

Fei Feng (Jira) Sun, 19 Mar 2023 20:50:37 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702429#comment-17702429
 ]


Fei Feng commented on FLINK-31482:
----------------------------------

[~martijnvisser]  Of course we will detect job's running metric. I mean we can 
not detect the job's failover times by ha now.

if job's `uptime` and `numRestarts` metric go down to zero and start counting 
again，we may think this job's jobmanager was failover by ha. Sometimes this 
change can also be caused by the user restarting the job

so I think we need need a more direct and accurate indicator to respond the 
job's jobmanager was failover by ha.

> support count jobmanager-failed failover times
> ----------------------------------------------
>
>                 Key: FLINK-31482
>                 URL: https://issues.apache.org/jira/browse/FLINK-31482
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.16.1
>            Reporter: Fei Feng
>            Priority: Major
>
> we have a  metric `numRestarts` which indicate how many times a job failover 
> ， but we don't have a metric indicate the job recover from ha ( high 
> availability).
> there are two problems:
> 1. when a  jobmanager process crashed , we have no way of knowing that 
> jobmanager is crash and job was recovered from metric system 
> 2. when a new jobmanager become leader, the  `numRestarts`  will started from 
> zero, 
> Sometimes misleading our users。most user think that whether failover because 
> of a JM failure or because of a job failure, these failover is same , the 
> effect, at least, is the same.
>  
> I suggest we can 
> 1. add new metric that indicate how many time the job was recovered from ha
> 2. metric `numRestarts` also count the times recover from ha  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31482) support count jobmanager-failed failover times

Reply via email to