Zhu Zhu created FLINK-14164: ------------------------------- Summary: Add a metric to show failover count regarding fine grained recovery Key: FLINK-14164 URL: https://issues.apache.org/jira/browse/FLINK-14164 Project: Flink Issue Type: Improvement Components: Runtime / Coordination, Runtime / Metrics Affects Versions: 1.9.0, 1.10.0 Reporter: Zhu Zhu Fix For: 1.10.0
Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers. However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries. As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also respects fine grained recoveries. -- This message was sent by Atlassian Jira (v8.3.4#803005)