[ 
https://issues.apache.org/jira/browse/FLINK-36717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora reassigned FLINK-36717:
----------------------------------

    Assignee: Swapna Marru  (was: Maximilian Michels)

> Add health check to detect tasks stuck in DEPLOYING state
> ---------------------------------------------------------
>
>                 Key: FLINK-36717
>                 URL: https://issues.apache.org/jira/browse/FLINK-36717
>             Project: Flink
>          Issue Type: New Feature
>          Components: Kubernetes Operator
>            Reporter: Maximilian Michels
>            Assignee: Swapna Marru
>            Priority: Major
>
> We have an opt-in feature for monitoring Flink cluster health by the 
> operator. To enable it, set kubernetes.operator.cluster.health-check.enabled: 
> true.
> If enabled, the ClusterHealthObserver, triggered by the 
> ApplicationReconciler, collects various health-related metrics from the Flink 
> cluster, such as the number of restarts, the last restart timestamp, the 
> number of completed checkpoints, and the last completed checkpoint timestamp.
> The ClusterHealthEvaluator then analyzes this information to determine 
> whether the Flink cluster is healthy or not.
> Recently, users have reported an issue where some TaskManagers get stuck in 
> the task state DEPLOYING due to a faulty network connection, causing 
> extremely slow TCP reads while fetching the user jar from S3. Restarting the 
> TaskManager pods resolves this issue.
> The goal of this ticket is to add a feature to the operator to automatically 
> restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve 
> this, we can monitor how long tasks remain in the DEPLOYING state and decide 
> to restart the TaskManagers after a configured timeout. We must be careful to 
> ensure that we don't include jobs with large state restores, which can take a 
> long time. Fortunately, the task state is in INITIALIZING during state 
> restoration, making it easily distinguishable from DEPLOYING when we still 
> setup the task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to