Hi, Recently I have encountered a strange behavior of Flink on YARN, which is that when I try to cancel a Flink job running in per-job mode on YARN using commands like
"cancel -m yarn-cluster -yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae" the client happily found and connected to ResourceManager and then stucks at Found Web Interface 172.28.28.3:50099 of application 'application_1559388106022_9412'. And after one minute, an exception is thrown at the client side: Caused by: org.apache.flink.util.FlinkException: Could not cancel job ed7e2e0ab0a7316c1b65df6047bc6aae. at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) at org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917) at org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) ... 20 more Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) ... 27 more Then I discovered that the YARN app has already terminated with FINISHED state and KILLED final status, like below. [image: image.png] And after digging into the log of this finished YARN app, I have found that TaskManager had already received the SIGTERM signal and terminated gracefully. org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. Also, the log of JobManager shows that it terminated with exit code 0. Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit code 0 However, the JobManager did not return anything to the client before its shutdown, which is different from previous versions (like Flink 1.9). I wonder if this is a new bug on the flink-clients or flink-yarn module? Thank you : ) Sincerely, Weike