I am seeing this weird issue where I do a save point with cancel on a job on k8s and it hangs for 5 minutes ( no INFO logs ) and then exits with code of 2.
2019-04-23 17:36:31,372 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting down rest endpoint. 2019-04-23 17:36:31,374 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-04-23 17:36:31,377 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Suspending SlotPool. 2019-04-23 17:36:31,378 DEBUG org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 181a4fd61044033a2ea32e384096247f. org.apache.flink.util.FlinkException: JobManager is shutting down. at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:365) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:504) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:170) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2019-04-23 17:36:31,381 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Stopping SlotPool. 2019-04-23 17:36:31,381 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Disconnect job manager a41a5dceae5ad3664ff1f0b79f3e47ef @akka.tcp://flink@kafka-to-prometheus:6123/user/jobmanager_0 for job 00000000000000000000000000000000 from the resource manager. 2019-04-23 17:36:31,385 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. and after 5 minutes .. 019-04-23 17:41:32,187 DEBUG org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 8 thread-local buffer(s) from thread: Finalizer 2019-04-23 17:41:32,198 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service. 2019-04-23 17:41:32,200 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit code 2. java.util.concurrent.TimeoutException at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:942) at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:360) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) In the interim, I get this at a regular clip 2019-04-23 17:37:02,452 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 3752235c49428b94a0520f04266973eb because it exceeded the idle timeout. 2019-04-23 17:37:02,453 DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 68c5fbd67ac2bbe6fc35ed068ce1c4b1 could not be stopped. Any ideas as to whether it is this inability to shut down the Worker that is causing this issue ? Regards.