Hi,

Recently I have encountered a strange behavior of Flink on YARN, which is
that when I try to cancel a Flink job running in per-job mode on YARN using
commands like

"cancel -m yarn-cluster
-yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae"

the client happily found and connected to ResourceManager and then stucks
at
Found Web Interface 172.28.28.3:50099
 of application 'application_1559388106022_9412'.

And after one minute, an exception is thrown at the client side:
Caused by: org.apache.flink.util.FlinkException: Could not cancel job
ed7e2e0ab0a7316c1b65df6047bc6aae.
    at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
    at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
    at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
    at 
org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
    at 
org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
    ... 20 more
Caused by: java.util.concurrent.TimeoutException
    at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
    ... 27 more

Then I discovered that the YARN app has already terminated with FINISHED
state and KILLED final status, like below.
[image: image.png]
And after digging into the log of this finished YARN app, I have found that
TaskManager had already received the SIGTERM signal and terminated
gracefully.
org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
SIGTERM. Shutting down as requested.

Also, the log of JobManager shows that it terminated with exit code 0.
Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit code 0

However, the JobManager did not return anything to the client before its
shutdown, which is different from previous versions (like Flink 1.9).

I wonder if this is a new bug on the flink-clients or flink-yarn module?

Thank you : )

Sincerely,
Weike

Reply via email to