[jira] [Commented] (FLINK-37567) Flink clusters not be clean up when using job cancellation as suspend mechanism

Alan Zhang (Jira) Fri, 28 Mar 2025 08:04:18 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939247#comment-17939247
 ]


Alan Zhang commented on FLINK-37567:
------------------------------------

Thanks for your suggestion [~gyfora] ! The limitation of supporting autosacer 
with standalone mode is critical to help people make tradeoffs on using native 
or standalone, we should highlight this in the docs. I might miss I didn't see 
we call this limitation out in Flink K8s operator docs.

>Generally requires setting number of TMs all the time by the user (as you 
>cannot determine it from config)
This is not true if you are referring that using the job parallisim and task 
slots count to infer TM count. I'm pretty sure standalone mode has this feature.

>Only the JM should stay there and this is normal
I just wanted to share in some edge case, keep JM stay there could cause some 
issue. We have observed one issue recently: We have one Flink job which has 
some logic to send authentication requests, this logic was triggered in JM . we 
"suspend" the FlinkDeployment, the job was canceled successfully and the Flink 
cluster(JM and TM) was still running, and we observed the JM was still keeping 
sending the requests. We had to shutdown the JM manually to fix the issue. For 
this kind of scenarios, it would be safer that we delete running JM as well.

What is the intention that we keep the JM running during "suspend" process? 


--------------------------------------------------------------------------------------------------
I can share a little bit context why we decided to start with standalone mode, 
and I thought this is one major value I can see why this standalone mode exists:

In my company, we have our own customization on K8s, for example, we have our 
K8s Deployment or StatefulSet implementation(e.g. XxDeployment, XxStatefulSet). 
To integrate our techstack, the easier (or the only way) to do it using 
standalone mode so that we can customize Flink K8s operator to let it use 
"XxDeployment" instead of K8s primitive "Deployment".

Another consideration from us is that we have our own Flink K8s operator 
implementation which uses standalone mode, we planned to use Flink official K8s 
operator to replace it. From migration perspective, keep feature parity(e.g. 
UX) can make the migration eaiser.

> Flink clusters not be clean up when using job cancellation as suspend 
> mechanism
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-37567
>                 URL: https://issues.apache.org/jira/browse/FLINK-37567
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.10.0, kubernetes-operator-1.11.0
>            Reporter: Alan Zhang
>            Priority: Major
>
> In general, for application mode, the Flink cluster lifecycle should be tight 
> with the Flink job lifecycle, which means we should delete the Flink cluster 
> if the job stopped.
> However, I noticed that Flink clusters are not deleted when I tried to 
> suspend FlinkDeployment with "job-cancel" enabled. The CR shows the job under 
> "CANCELED" state, but the underlying Flink cluster is still running.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37567) Flink clusters not be clean up when using job cancellation as suspend mechanism

Reply via email to