Zhu Zhu created FLINK-14164:
-------------------------------

             Summary: Add a metric to show failover count regarding fine 
grained recovery
                 Key: FLINK-14164
                 URL: https://issues.apache.org/jira/browse/FLINK-14164
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination, Runtime / Metrics
    Affects Versions: 1.9.0, 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0


Previously Flink uses restart all strategy to recover jobs from failures. And 
the metric "fullRestart" is used to show the count of failovers.

However, with fine grained recovery introduced in 1.9.0, the "fullRestart" 
metric only reveals how many times the entire graph has been restarted, not 
including the number of fine grained failure recoveries.

As many users want to build their job alerting based on failovers, I'd propose 
to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also 
respects fine grained recoveries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to