[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

Nicolas Fraison (Jira) Thu, 18 Jan 2024 00:26:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicolas Fraison updated FLINK-34131:
------------------------------------
    Priority: Minor  (was: Major)

> Checkpoint check window should take in account checkpoint job configuration
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-34131
>                 URL: https://issues.apache.org/jira/browse/FLINK-34131
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Nicolas Fraison
>            Priority: Minor
>
> When enabling checkpoint progress check 
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
> define cluster health the operator rely detect if a checkpoint has been 
> performed during the 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and 
> unwanted restart of the flink cluster if the checkpointing interval is bigger 
> than the window one.
> The operator must check that the config is healthy before to rely on this 
> check. If it is not well set it should not execute the check (return true on 
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
>  and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in 
> account for this window configuration which are 
> execution.checkpointing.timeout and 
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= 
> to (execution.checkpointing.interval + execution.checkpointing.timeout) * 
> execution.checkpointing.tolerable-failed-checkpoints



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34131) Checkpoint check window should take in account checkpoint job configuration

Reply via email to