[ 
https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267821#comment-17267821
 ] 

Till Rohrmann commented on FLINK-21008:
---------------------------------------

I see, then an alternative solution would be to signal the external system to 
shut down after the whole Flink clean up has been done. The problem here is 
that the communication logic with the external client is encapsulated in the 
{{ResourceManager}} which at this point is already shut down.

> ClusterEntrypoint#shutDownAsync may not be fully executed
> ---------------------------------------------------------
>
>                 Key: FLINK-21008
>                 URL: https://issues.apache.org/jira/browse/FLINK-21008
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.1
>            Reporter: Yang Wang
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> Recently, in our internal use case for native K8s integration with K8s HA 
> enabled, we found that the leader related ConfigMaps could be residual in 
> some corner situations.
> After some investigations, I think it is possibly caused by the inappropriate 
> shutdown process.
> In {{ClusterEntrypoint#shutDownAsync}}, we first call the 
> {{closeClusterComponent}}, which also includes deregistering the Flink 
> application from cluster management(e.g. Yarn, K8s). Then we call the 
> {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster 
> management do the deregister very fast, the JobManager process receives 
> SIGNAL 15 before or is being executing the {{stopClusterServices}} and 
> {{cleanupDirectories}}. The jvm process will directly exit then. So the two 
> methods may not be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to