Nicolas Fraison created FLINK-34131:
---------------------------------------
Summary: Checkpoint check window should take in account checkpoint
job configuration
Key: FLINK-34131
URL: https://issues.apache.org/jira/browse/FLINK-34131
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Reporter: Nicolas Fraison
When enabling checkpoint progress check
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to
define cluster health the operator rely detect if a checkpoint has been
performed during the
kubernetes.operator.cluster.health-check.checkpoint-progress.window
As indicated in the doc it must be bigger to checkpointing interval.
But this is a manual configuration which can leads to misconfiguration and
unwanted restart of the flink cluster if the checkpointing interval is bigger
than the window one.
The operator must check that the config is healthy before to rely on this
check. If it is not well set it should not execute the check (return true on
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
and log a WARN message.
Also flink jobs have other checkpointing parameters that should be taken in
account for this window configuration which are execution.checkpointing.timeout
and execution.checkpointing.tolerable-failed-checkpoints
The idea would be to check that
kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to
(execution.checkpointing.interval + execution.checkpointing.timeout) *
execution.checkpointing.tolerable-failed-checkpoints
--
This message was sent by Atlassian Jira
(v8.20.10#820010)