[jira] [Commented] (FLINK-35145) Add timeout for cluster termination

Nishita Pattanayak (Jira) Tue, 04 Mar 2025 07:17:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-35145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932348#comment-17932348
 ]


Nishita Pattanayak commented on FLINK-35145:
--------------------------------------------

Is this being worked upon? I would like to give this a go in that case. We have 
also seen that if we try to perform cluster cleanup and the flinksessionjob is 
already in a terminal state (FAILED: while the operator tries to first cancel 
the flinksessionjobs). It is blocked as it says flinksession job is already in 
terminal state and Flinkdeployment still has flinksessionjob tied to it, which 
does not let clustercleanup happen until flinksessionjob CRD is completed 
deleted.

> Add timeout for cluster termination
> -----------------------------------
>
>                 Key: FLINK-35145
>                 URL: https://issues.apache.org/jira/browse/FLINK-35145
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.20.0
>            Reporter: Zhanghao Chen
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Currently, cluster termination may be blocked forever as there's no timeout 
> for that. For example, for an Application cluster with ZK HA enabled, when ZK 
> cluster is down, the cluster will reach termination status, but the 
> termination process will be blocked when trying to clean up HA data on ZK, 
> where the ZK client will retry connecting to ZK forever. Similar phenomenon 
> can be observed when an HDFS outage occurs.
> I propose adding a timeout for the cluster termination process in 
> ClusterEntryPoint#
> shutDownAsync method. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35145) Add timeout for cluster termination

Reply via email to