Thank you, Yang: We have found the root cause. In the logic of Flink operator, it calls Flink's rest API to stop this job then calls the K8s's API to stop the deployment of Flink jobManager. However it took more than one minute for K8s to delete that deployment, so when the JM's main contain has been successfully shut down by the REST API, then it was restarted by the restart policy, because the pod was still not deleted. That's why we observed `Jobmanager restart after it has been requested to stop` ________________________________ 发件人: Yang Wang <wangyang0...@apache.org> 发送时间: 2024年2月2日 17:56 收件人: Liting Liu (litiliu) <liti...@cisco.com> 抄送: user <user@flink.apache.org> 主题: Re: Jobmanager restart after it has been requested to stop
If you could find the "Deregistering Flink Kubernetes cluster, clusterId" in the JobManager log, then it is not the expected behavior. Having the full logs of JobManager Pod before restarted will help a lot. Best, Yang On Fri, Feb 2, 2024 at 1:26 PM Liting Liu (litiliu) via user <user@flink.apache.org<mailto:user@flink.apache.org>> wrote: Hi, community: I'm running a Flink 1.14.3 job with flink-Kubernetes-operator-1.6.0 on the AWS. I found my flink jobmananger container's thread restarted after this flinkdeployment has been requested to stop, here is the log of jobmanager: 2024-02-01 21:57:48,977 tn="flink-akka.actor.default-dispatcher-107478" INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application CANCELED: java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:353) ~[flink-dist_2.11-1.14.3.jar:1.14.3] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) ~[?:1.8.0_322] 2024-02-01 21:57:48,984 tn="flink-akka.actor.default-dispatcher-107484" INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint. 2024-02-01 21:57:49,103 tn="flink-akka.actor.default-dispatcher-107478" INFO org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent [] - Closing components. 2024-02-01 21:57:49,105 tn="flink-akka.actor.default-dispatcher-107484" INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped dispatcher akka.tcp://flink@ 2024-02-01 21:57:49,112 tn="AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping Akka RPC service. 2024-02-01 21:57:49,286 tn="flink-metrics-15" INFO akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting shut down. 2024-02-01 21:57:49,387 tn="main" INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Terminating cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit code 0. 2024-02-01 21:57:53,828 tn="main" INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 2024-02-01 21:57:54,287 tn="main" INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Starting KubernetesApplicationClusterEntrypoint. I found the JM main container's containerId remains the same, after the JM auto-restart. why did this process start to run after it had been requested to stop?