Hi Yangze and all, I have tried numerous times, and this behavior persists.
Below is the tail log of taskmanager.log: 2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.503gb (1613968148 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.403gb (1505922928 bytes), networkMemory=359.040mb (376480732 bytes)}, allocationId: d3acaeac3db62454742e800b5410adfd, jobId: d0a674795be98bd2574d9ea3286801cb). 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job d0a674795be98bd2574d9ea3286801cb from job leader monitoring. 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job d0a674795be98bd2574d9ea3286801cb. 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job d0a674795be98bd2574d9ea3286801cb. 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job d0a674795be98bd2574d9ea3286801cb because it is not registered. 2020-03-13 12:06:19.744 [SIGTERM handler] INFO org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2020-03-13 12:06:19.744 [SIGTERM handler] INFO org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2020-03-13 12:06:19.745 [PermanentBlobCache shutdown hook] INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache 2020-03-13 12:06:19.749 [FileChannelManagerImpl-netty-shuffle shutdown hook] INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl - FileChannelManager removed spill file directory /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-netty-shuffle-65cd4ebb-51f4-48a9-8e3c-43e431bca46d 2020-03-13 12:06:19.750 [TaskExecutorLocalStateStoresManager shutdown hook] INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager. 2020-03-13 12:06:19.750 [TransientBlobCache shutdown hook] INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache 2020-03-13 12:06:19.751 [FileChannelManagerImpl-io shutdown hook] INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl - FileChannelManager removed spill file directory /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-io-67ad5c3a-aec6-42be-ab1f-0ce3841fc4bd 2020-03-13 12:06:19.752 [FileCache shutdown hook] INFO org.apache.flink.runtime.filecache.FileCache - removed file cache directory /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-dist-cache-65075ee3-e009-4978-a9d8-ec010e6f4b31 As the tail log of jobmanager.log is kind of lengthy, I have attached it in this mail. >From what I have seen, the TaskManager and JobManager shut down by themselves, however, I have noticed some Netty exceptions (from the stack trace, it is part of the REST handler) like: ERROR org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution - Failed to submit a listener notification task. Event loop shut down? java.util.concurrent.RejectedExecutionException: event executor terminated Thus I suppose that these exceptions might be the actual cause of premature termination of the REST server, and I am still looking into the real cause of this. Best, Weike On Fri, Mar 13, 2020 at 1:45 PM Yangze Guo <karma...@gmail.com> wrote: > Would you mind to share more information about why the task executor > is killed? If it is killed by Yarn, you might get such info in Yarn > NM/RM logs. > > Best, > Yangze Guo > > Best, > Yangze Guo > > > On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk> > wrote: > > > > Hi, > > > > Recently I have encountered a strange behavior of Flink on YARN, which > is that when I try to cancel a Flink job running in per-job mode on YARN > using commands like > > > > "cancel -m yarn-cluster -yid application_1559388106022_9412 > ed7e2e0ab0a7316c1b65df6047bc6aae" > > > > the client happily found and connected to ResourceManager and then > stucks at > > Found Web Interface 172.28.28.3:50099 of application > 'application_1559388106022_9412'. > > > > And after one minute, an exception is thrown at the client side: > > Caused by: org.apache.flink.util.FlinkException: Could not cancel job > ed7e2e0ab0a7316c1b65df6047bc6aae. > > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > > at > org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > > at > org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917) > > at > org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > > ... 20 more > > Caused by: java.util.concurrent.TimeoutException > > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > > ... 27 more > > > > Then I discovered that the YARN app has already terminated with FINISHED > state and KILLED final status, like below. > > > > And after digging into the log of this finished YARN app, I have found > that TaskManager had already received the SIGTERM signal and terminated > gracefully. > > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: > SIGTERM. Shutting down as requested. > > > > Also, the log of JobManager shows that it terminated with exit code 0. > > Terminating cluster entrypoint process YarnJobClusterEntrypoint with > exit code 0 > > > > However, the JobManager did not return anything to the client before its > shutdown, which is different from previous versions (like Flink 1.9). > > > > I wonder if this is a new bug on the flink-clients or flink-yarn module? > > > > Thank you : ) > > > > Sincerely, > > Weike >
jobmanager.log
Description: Binary data