gaborgsomogyi opened a new pull request, #513: URL: https://github.com/apache/flink-kubernetes-operator/pull/513
## What is the purpose of the change There are workloads which stuck in such a way that they're in RUNNING state most of the time but not able to proceed and make checkpoints. Such cases must be detected by the operator. In this PR I've added the possibility to ask the operator to watch the number of successful checkpoints. If the feature is enabled by `cluster.health-check.completed-checkpoints.enabled` and there are no successful checkpoint within the defined window in `cluster.health-check.completed-checkpoints.window` then the operator considers it as unhealthy deployment and re-creates it. ## Brief change log * Added config `cluster.health-check.completed-checkpoints.enabled` * Added config `cluster.health-check.completed-checkpoints.window` * Added number of successful checkpoints watching ## Verifying this change Changed/added automated tests + manually on Minikube (stateless job w/o checkpoint restarted all the time). ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: no ## Documentation - Does this pull request introduce a new feature? yes - If yes, how is the feature documented? docs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org