[ https://issues.apache.org/jira/browse/FLINK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Zhen Wu updated FLINK-17531: ----------------------------------- Description: like to discuss the value of a new checkpoint Guage metric: `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I know reasons below are somewhat related to our setup. Hence want to explore the interest of the community. *What do we want to achieve?* We want to alert if no successful checkpoint happened for a specific period. With this new metric, we can set up a simple alerting rule like `alert if elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting pattern of `time since last success`. We found `elapsedTimeSinceLastCompletedCheckpoint` metric very easy and intuitive to set up alert against. *What out existing checkpoint metrics?* `numberOfCompletedCheckpoints`. We can set up an alert like `alert if derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is an anti-pattern for our alerting system, as it is looking for lack of good signal (vs explicit bad signal). Such an anti-pattern is easier to suffer false alarm problem when there is occasional metric drop or alerting system processing issue. `numberOfFailedCheckpoints`. That is an explicit failure signal, which is good. We can set up alert like `alert if derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some high-parallelism large-state jobs. Their normal checkpoint duration is <1-2 minutes. However, when recovering from an outage with large backlog, sometimes subtasks from one or a few containers experienced super high back pressure. It took checkpoint barrier sometimes more than an hour to travel through the DAG to those heavy back pressured subtasks. Causes of the back pressure are likely due to multi-tenancy environment and performance variation among containers. Instead of letting checkpoint to time out in this case, we decided to increase checkpoint timeout value to crazy long value (like 2 hours). With that, we kind of missed the explicit "bad" signal of failed/timed out checkpoint. In theory, one could argue that we can set checkpoint timeout to infinity. It is always better to have a long but completed checkpoint than a timed out checkpoint, as timed out checkpoint basically give up its positions in the queue and new checkpoint just reset the positions back to the end of the queue . Note that we are using at least checkpoint semantics. So there is no barrier alignment concern. FLIP-76 (unaligned checkpoints) can help checkpoint dealing with back pressure better. It is not ready now and also has its limitations. That is a separate discussion. was: like to discuss the value of a new checkpoint Guage metric: `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I know reasons below are somewhat related to our setup. Hence want to explore the interest of the community. *What do we want to achieve?* We want to alert if no successful checkpoint happened for a specific period. With this new metric, we can set up a simple alerting rule like `alert if elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting pattern of `time since last success`. We found `elapsedTimeSinceLastCompletedCheckpoint` very intuitive to set up alert against. *What out existing checkpoint metrics?* `numberOfCompletedCheckpoints`. We can set up an alert like `alert if derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is an anti-pattern for our alerting system, as it is looking for lack of good signal (vs explicit bad signal). Such an anti-pattern is easier to suffer false alarm problem when there is occasional metric drop or alerting system processing issue. `numberOfFailedCheckpoints`. That is an explicit failure signal, which is good. We can set up alert like `alert if derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some high-parallelism large-state jobs. Their normal checkpoint duration is <1-2 minutes. However, when recovering from an outage with large backlog, sometimes subtasks from one or a few containers experienced super high back pressure. It took checkpoint barrier sometimes more than an hour to travel through the DAG to those heavy back pressured subtasks. Causes of the back pressure are likely due to multi-tenancy environment and performance variation among containers. Instead of letting checkpoint to time out in this case, we decided to increase checkpoint timeout value to crazy long value (like 2 hours). With that, we kind of missed the explicit "bad" signal of failed/timed out checkpoint. In theory, one could argue that we can set checkpoint timeout to infinity. It is always better to have a long but completed checkpoint than a timed out checkpoint, as timed out checkpoint basically give up its positions in the queue and new checkpoint just reset the positions back to the end of the queue . Note that we are using at least checkpoint semantics. So there is no barrier alignment concern. FLIP-76 (unaligned checkpoints) can help checkpoint dealing with back pressure better. It is not ready now and also has its limitations. That is a separate discussion. > Add a new checkpoint Guage metric: elapsedTimeSinceLastCompletedCheckpoint > -------------------------------------------------------------------------- > > Key: FLINK-17531 > URL: https://issues.apache.org/jira/browse/FLINK-17531 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing > Affects Versions: 1.10.0 > Reporter: Steven Zhen Wu > Priority: Major > > like to discuss the value of a new checkpoint Guage metric: > `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I > know reasons below are somewhat related to our setup. Hence want to explore > the interest of the community. > > *What do we want to achieve?* > We want to alert if no successful checkpoint happened for a specific period. > With this new metric, we can set up a simple alerting rule like `alert if > elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting > pattern of `time since last success`. We found > `elapsedTimeSinceLastCompletedCheckpoint` metric very easy and intuitive to > set up alert against. > > *What out existing checkpoint metrics?* > `numberOfCompletedCheckpoints`. We can set up an alert like `alert if > derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is > an anti-pattern for our alerting system, as it is looking for lack of good > signal (vs explicit bad signal). Such an anti-pattern is easier to suffer > false alarm problem when there is occasional metric drop or alerting system > processing issue. > > `numberOfFailedCheckpoints`. That is an explicit failure signal, which is > good. We can set up alert like `alert if > derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some > high-parallelism large-state jobs. Their normal checkpoint duration is <1-2 > minutes. However, when recovering from an outage with large backlog, > sometimes subtasks from one or a few containers experienced super high back > pressure. It took checkpoint barrier sometimes more than an hour to travel > through the DAG to those heavy back pressured subtasks. Causes of the back > pressure are likely due to multi-tenancy environment and performance > variation among containers. Instead of letting checkpoint to time out in this > case, we decided to increase checkpoint timeout value to crazy long value > (like 2 hours). With that, we kind of missed the explicit "bad" signal of > failed/timed out checkpoint. > > In theory, one could argue that we can set checkpoint timeout to infinity. It > is always better to have a long but completed checkpoint than a timed out > checkpoint, as timed out checkpoint basically give up its positions in the > queue and new checkpoint just reset the positions back to the end of the > queue . Note that we are using at least checkpoint semantics. So there is no > barrier alignment concern. FLIP-76 (unaligned checkpoints) can help > checkpoint dealing with back pressure better. It is not ready now and also > has its limitations. That is a separate discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)