ashangit commented on code in PR #758:
URL:
https://github.com/apache/flink-kubernetes-operator/pull/758#discussion_r1457417705
##########
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java:
##########
@@ -174,6 +175,25 @@ private boolean evaluateCheckpoints(
return true;
}
+ var completedCheckpointsCheckWindow =
+
configuration.get(OPERATOR_CLUSTER_HEALTH_CHECK_CHECKPOINT_PROGRESS_WINDOW);
+
+ CheckpointConfig checkpointConfig = new CheckpointConfig();
+ checkpointConfig.configure(configuration);
+ var checkpointingInterval = checkpointConfig.getCheckpointInterval();
+ var checkpointingTimeout = checkpointConfig.getCheckpointTimeout();
+ var tolerationFailureNumber =
checkpointConfig.getTolerableCheckpointFailureNumber() + 1;
+ if (completedCheckpointsCheckWindow.toMillis()
+ < Math.max(
+ checkpointingInterval * tolerationFailureNumber,
+ checkpointingTimeout * tolerationFailureNumber)) {
+ LOG.warn(
+ "cluster.health-check.checkpoint-progress.window is not
long enough. "
+ + "It must be greater than
max(execution.checkpointing.interval *
execution.checkpointing.tolerable-failed-checkpoints, "
+ + "execution.checkpointing.timeout *
execution.checkpointing.tolerable-failed-checkpoints)");
+ return true;
+ }
Review Comment:
Looks fine to me to default to appropriate value but I'm not sure to
understand why defaulting to `2 * cpInterval` instead of
`Math.max(checkpointingInterval * tolerationFailureNumber,checkpointingTimeout
* tolerationFailureNumber)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]