Maximilian Michels created FLINK-36717:
------------------------------------------

             Summary: Add health check to detect tasks stuck in DEPLOYING state
                 Key: FLINK-36717
                 URL: https://issues.apache.org/jira/browse/FLINK-36717
             Project: Flink
          Issue Type: New Feature
          Components: Kubernetes Operator
            Reporter: Maximilian Michels


We have an opt-in feature for monitoring Flink cluster health by the operator. 
To enable it, set kubernetes.operator.cluster.health-check.enabled: true.

If enabled, the ClusterHealthObserver, triggered by the ApplicationReconciler, 
collects various health-related metrics from the Flink cluster, such as the 
number of restarts, the last restart timestamp, the number of completed 
checkpoints, and the last completed checkpoint timestamp.

The ClusterHealthEvaluator then analyzes this information to determine whether 
the Flink cluster is healthy or not.

Recently, users have reported an issue where some TaskManagers get stuck in the 
task state DEPLOYING due to a faulty network connection, causing extremely slow 
TCP reads while fetching the user jar from S3. Restarting the TaskManager pods 
resolves this issue.

The goal of this ticket is to add a feature to the operator to automatically 
restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve 
this, we can monitor how long tasks remain in the DEPLOYING state and decide to 
restart the TaskManagers after a configured timeout. We must be careful to 
ensure that we don't include jobs with large state restores, which can take a 
long time. Fortunately, the task state is in INITIALIZING during state 
restoration, making it easily distinguishable from DEPLOYING when we still 
setup the task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to