[ https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicolas Fraison updated FLINK-34131: ------------------------------------ Priority: Minor (was: Major) > Checkpoint check window should take in account checkpoint job configuration > --------------------------------------------------------------------------- > > Key: FLINK-34131 > URL: https://issues.apache.org/jira/browse/FLINK-34131 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Reporter: Nicolas Fraison > Priority: Minor > > When enabling checkpoint progress check > (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to > define cluster health the operator rely detect if a checkpoint has been > performed during the > kubernetes.operator.cluster.health-check.checkpoint-progress.window > As indicated in the doc it must be bigger to checkpointing interval. > But this is a manual configuration which can leads to misconfiguration and > unwanted restart of the flink cluster if the checkpointing interval is bigger > than the window one. > The operator must check that the config is healthy before to rely on this > check. If it is not well set it should not execute the check (return true on > [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50]) > and log a WARN message. > Also flink jobs have other checkpointing parameters that should be taken in > account for this window configuration which are > execution.checkpointing.timeout and > execution.checkpointing.tolerable-failed-checkpoints > The idea would be to check that > kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= > to (execution.checkpointing.interval + execution.checkpointing.timeout) * > execution.checkpointing.tolerable-failed-checkpoints -- This message was sent by Atlassian Jira (v8.20.10#820010)