Maximilian Michels created FLINK-36717: ------------------------------------------
Summary: Add health check to detect tasks stuck in DEPLOYING state Key: FLINK-36717 URL: https://issues.apache.org/jira/browse/FLINK-36717 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Maximilian Michels We have an opt-in feature for monitoring Flink cluster health by the operator. To enable it, set kubernetes.operator.cluster.health-check.enabled: true. If enabled, the ClusterHealthObserver, triggered by the ApplicationReconciler, collects various health-related metrics from the Flink cluster, such as the number of restarts, the last restart timestamp, the number of completed checkpoints, and the last completed checkpoint timestamp. The ClusterHealthEvaluator then analyzes this information to determine whether the Flink cluster is healthy or not. Recently, users have reported an issue where some TaskManagers get stuck in the task state DEPLOYING due to a faulty network connection, causing extremely slow TCP reads while fetching the user jar from S3. Restarting the TaskManager pods resolves this issue. The goal of this ticket is to add a feature to the operator to automatically restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve this, we can monitor how long tasks remain in the DEPLOYING state and decide to restart the TaskManagers after a configured timeout. We must be careful to ensure that we don't include jobs with large state restores, which can take a long time. Fortunately, the task state is in INITIALIZING during state restoration, making it easily distinguishable from DEPLOYING when we still setup the task. -- This message was sent by Atlassian Jira (v8.20.10#820010)