[ https://issues.apache.org/jira/browse/FLINK-36717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gyula Fora reassigned FLINK-36717: ---------------------------------- Assignee: Swapna Marru (was: Maximilian Michels) > Add health check to detect tasks stuck in DEPLOYING state > --------------------------------------------------------- > > Key: FLINK-36717 > URL: https://issues.apache.org/jira/browse/FLINK-36717 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator > Reporter: Maximilian Michels > Assignee: Swapna Marru > Priority: Major > > We have an opt-in feature for monitoring Flink cluster health by the > operator. To enable it, set kubernetes.operator.cluster.health-check.enabled: > true. > If enabled, the ClusterHealthObserver, triggered by the > ApplicationReconciler, collects various health-related metrics from the Flink > cluster, such as the number of restarts, the last restart timestamp, the > number of completed checkpoints, and the last completed checkpoint timestamp. > The ClusterHealthEvaluator then analyzes this information to determine > whether the Flink cluster is healthy or not. > Recently, users have reported an issue where some TaskManagers get stuck in > the task state DEPLOYING due to a faulty network connection, causing > extremely slow TCP reads while fetching the user jar from S3. Restarting the > TaskManager pods resolves this issue. > The goal of this ticket is to add a feature to the operator to automatically > restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve > this, we can monitor how long tasks remain in the DEPLOYING state and decide > to restart the TaskManagers after a configured timeout. We must be careful to > ensure that we don't include jobs with large state restores, which can take a > long time. Fortunately, the task state is in INITIALIZING during state > restoration, making it easily distinguishable from DEPLOYING when we still > setup the task. -- This message was sent by Atlassian Jira (v8.20.10#820010)