gaborgsomogyi opened a new pull request, #513:
URL: https://github.com/apache/flink-kubernetes-operator/pull/513

   ## What is the purpose of the change
   
   There are workloads which stuck in such a way that they're in RUNNING state 
most of the time but not able to proceed and make checkpoints. Such cases must 
be detected by the operator. In this PR I've added the possibility to ask the 
operator to watch the number of successful checkpoints. If the feature is 
enabled by `cluster.health-check.completed-checkpoints.enabled` and there are 
no successful checkpoint within the defined window in 
`cluster.health-check.completed-checkpoints.window` then the operator considers 
it as unhealthy deployment and re-creates it.
   
   ## Brief change log
   
   * Added config `cluster.health-check.completed-checkpoints.enabled`
   * Added config `cluster.health-check.completed-checkpoints.window`
   * Added number of successful checkpoints watching
   
   ## Verifying this change
   
   Changed/added automated tests + manually on Minikube (stateless job w/o 
checkpoint restarted all the time).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? docs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to