Hi Weike, could you share the complete logs with us? Attachments are being filtered out by the Apache mail server but it works if you upload the logs somewhere (e.g. https://gist.github.com/) and then share the link with us. Ideally you run the cluster with DEBUG log settings.
I assume that you are running Flink 1.10, right? My suspicion is that this behaviour has been introduced with FLINK-15116 [1]. It looks as if we complete the shutdown future in MiniDispatcher#cancelJob before we return the response to the RestClusterClient. My guess is that this triggers the shutdown of the RestServer which then is not able to serve the response to the client. I'm pulling in Aljoscha and Tison who introduced this change. They might be able to verify my theory and propose a solution for it. [1] https://issues.apache.org/jira/browse/FLINK-15116 Cheers, Till On Fri, Mar 13, 2020 at 7:50 AM DONG, Weike <kyled...@connect.hku.hk> wrote: > Hi Yangze and all, > > I have tried numerous times, and this behavior persists. > > Below is the tail log of taskmanager.log: > > 2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot > TaskSlot(index:0, state:ACTIVE, resource profile: > ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.503gb > (1613968148 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.403gb > (1505922928 bytes), networkMemory=359.040mb (376480732 bytes)}, > allocationId: d3acaeac3db62454742e800b5410adfd, jobId: > d0a674795be98bd2574d9ea3286801cb). > 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job > d0a674795be98bd2574d9ea3286801cb from job leader monitoring. > 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager > connection for job d0a674795be98bd2574d9ea3286801cb. > 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager > connection for job d0a674795be98bd2574d9ea3286801cb. > 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect > to job d0a674795be98bd2574d9ea3286801cb because it is not registered. > 2020-03-13 12:06:19.744 [SIGTERM handler] INFO > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: > SIGTERM. Shutting down as requested. > 2020-03-13 12:06:19.744 [SIGTERM handler] INFO > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: > SIGTERM. Shutting down as requested. > 2020-03-13 12:06:19.745 [PermanentBlobCache shutdown hook] INFO > org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB > cache > 2020-03-13 12:06:19.749 [FileChannelManagerImpl-netty-shuffle shutdown > hook] INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl - > FileChannelManager removed spill file directory > /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-netty-shuffle-65cd4ebb-51f4-48a9-8e3c-43e431bca46d > 2020-03-13 12:06:19.750 [TaskExecutorLocalStateStoresManager shutdown > hook] INFO > org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - > Shutting down TaskExecutorLocalStateStoresManager. > 2020-03-13 12:06:19.750 [TransientBlobCache shutdown hook] INFO > org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB > cache > 2020-03-13 12:06:19.751 [FileChannelManagerImpl-io shutdown hook] INFO > org.apache.flink.runtime.io.disk.FileChannelManagerImpl - > FileChannelManager removed spill file directory > /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-io-67ad5c3a-aec6-42be-ab1f-0ce3841fc4bd > 2020-03-13 12:06:19.752 [FileCache shutdown hook] INFO > org.apache.flink.runtime.filecache.FileCache - removed file cache > directory > /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-dist-cache-65075ee3-e009-4978-a9d8-ec010e6f4b31 > > As the tail log of jobmanager.log is kind of lengthy, I have attached it > in this mail. > > From what I have seen, the TaskManager and JobManager shut down by > themselves, however, I have noticed some Netty exceptions (from the stack > trace, it is part of the REST handler) like: > > ERROR > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution > - Failed to submit a listener notification task. Event loop shut down? > java.util.concurrent.RejectedExecutionException: event executor terminated > > Thus I suppose that these exceptions might be the actual cause of > premature termination of the REST server, and I am still looking into the > real cause of this. > > Best, > Weike > > On Fri, Mar 13, 2020 at 1:45 PM Yangze Guo <karma...@gmail.com> wrote: > >> Would you mind to share more information about why the task executor >> is killed? If it is killed by Yarn, you might get such info in Yarn >> NM/RM logs. >> >> Best, >> Yangze Guo >> >> Best, >> Yangze Guo >> >> >> On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk> >> wrote: >> > >> > Hi, >> > >> > Recently I have encountered a strange behavior of Flink on YARN, which >> is that when I try to cancel a Flink job running in per-job mode on YARN >> using commands like >> > >> > "cancel -m yarn-cluster -yid application_1559388106022_9412 >> ed7e2e0ab0a7316c1b65df6047bc6aae" >> > >> > the client happily found and connected to ResourceManager and then >> stucks at >> > Found Web Interface 172.28.28.3:50099 of application >> 'application_1559388106022_9412'. >> > >> > And after one minute, an exception is thrown at the client side: >> > Caused by: org.apache.flink.util.FlinkException: Could not cancel job >> ed7e2e0ab0a7316c1b65df6047bc6aae. >> > at >> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) >> > at >> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) >> > at >> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) >> > at >> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917) >> > at >> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988) >> > at java.security.AccessController.doPrivileged(Native Method) >> > at javax.security.auth.Subject.doAs(Subject.java:422) >> > at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) >> > ... 20 more >> > Caused by: java.util.concurrent.TimeoutException >> > at >> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) >> > at >> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) >> > at >> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) >> > ... 27 more >> > >> > Then I discovered that the YARN app has already terminated with >> FINISHED state and KILLED final status, like below. >> > >> > And after digging into the log of this finished YARN app, I have found >> that TaskManager had already received the SIGTERM signal and terminated >> gracefully. >> > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: >> SIGTERM. Shutting down as requested. >> > >> > Also, the log of JobManager shows that it terminated with exit code 0. >> > Terminating cluster entrypoint process YarnJobClusterEntrypoint with >> exit code 0 >> > >> > However, the JobManager did not return anything to the client before >> its shutdown, which is different from previous versions (like Flink 1.9). >> > >> > I wonder if this is a new bug on the flink-clients or flink-yarn module? >> > >> > Thank you : ) >> > >> > Sincerely, >> > Weike >> >