Would you mind to share more information about why the task executor is killed? If it is killed by Yarn, you might get such info in Yarn NM/RM logs.
Best, Yangze Guo Best, Yangze Guo On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk> wrote: > > Hi, > > Recently I have encountered a strange behavior of Flink on YARN, which is > that when I try to cancel a Flink job running in per-job mode on YARN using > commands like > > "cancel -m yarn-cluster -yid application_1559388106022_9412 > ed7e2e0ab0a7316c1b65df6047bc6aae" > > the client happily found and connected to ResourceManager and then stucks at > Found Web Interface 172.28.28.3:50099 of application > 'application_1559388106022_9412'. > > And after one minute, an exception is thrown at the client side: > Caused by: org.apache.flink.util.FlinkException: Could not cancel job > ed7e2e0ab0a7316c1b65df6047bc6aae. > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) > at > org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917) > at > org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > ... 20 more > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at > org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) > ... 27 more > > Then I discovered that the YARN app has already terminated with FINISHED > state and KILLED final status, like below. > > And after digging into the log of this finished YARN app, I have found that > TaskManager had already received the SIGTERM signal and terminated gracefully. > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: SIGTERM. > Shutting down as requested. > > Also, the log of JobManager shows that it terminated with exit code 0. > Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit > code 0 > > However, the JobManager did not return anything to the client before its > shutdown, which is different from previous versions (like Flink 1.9). > > I wonder if this is a new bug on the flink-clients or flink-yarn module? > > Thank you : ) > > Sincerely, > Weike