
Maximilian Michels reassigned FLINK-36717:

    Assignee: Maximilian Michels

> Add health check to detect tasks stuck in DEPLOYING state
> ---------------------------------------------------------
>                 Key: FLINK-36717
>                 URL: https://issues.apache.org/jira/browse/FLINK-36717
>             Project: Flink
>          Issue Type: New Feature
>          Components: Kubernetes Operator
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
> We have an opt-in feature for monitoring Flink cluster health by the 
> operator. To enable it, set kubernetes.operator.cluster.health-check.enabled: 
> true.
> If enabled, the ClusterHealthObserver, triggered by the 
> ApplicationReconciler, collects various health-related metrics from the Flink 
> cluster, such as the number of restarts, the last restart timestamp, the 
> number of completed checkpoints, and the last completed checkpoint timestamp.
> The ClusterHealthEvaluator then analyzes this information to determine 
> whether the Flink cluster is healthy or not.
> Recently, users have reported an issue where some TaskManagers get stuck in 
> the task state DEPLOYING due to a faulty network connection, causing 
> extremely slow TCP reads while fetching the user jar from S3. Restarting the 
> TaskManager pods resolves this issue.
> The goal of this ticket is to add a feature to the operator to automatically 
> restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve 
> this, we can monitor how long tasks remain in the DEPLOYING state and decide 
> to restart the TaskManagers after a configured timeout. We must be careful to 
> ensure that we don't include jobs with large state restores, which can take a 
> long time. Fortunately, the task state is in INITIALIZING during state 
> restoration, making it easily distinguishable from DEPLOYING when we still 
> setup the task.

This message was sent by Atlassian Jira

Reply via email to