Would you mind to share more information about why the task executor
is killed? If it is killed by Yarn, you might get such info in Yarn
NM/RM logs.

Best,
Yangze Guo

Best,
Yangze Guo


On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk> wrote:
>
> Hi,
>
> Recently I have encountered a strange behavior of Flink on YARN, which is 
> that when I try to cancel a Flink job running in per-job mode on YARN using 
> commands like
>
> "cancel -m yarn-cluster -yid application_1559388106022_9412 
> ed7e2e0ab0a7316c1b65df6047bc6aae"
>
> the client happily found and connected to ResourceManager and then stucks at
> Found Web Interface 172.28.28.3:50099 of application 
> 'application_1559388106022_9412'.
>
> And after one minute, an exception is thrown at the client side:
> Caused by: org.apache.flink.util.FlinkException: Could not cancel job 
> ed7e2e0ab0a7316c1b65df6047bc6aae.
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>     at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>     at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>     at 
> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>     ... 20 more
> Caused by: java.util.concurrent.TimeoutException
>     at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
>     ... 27 more
>
> Then I discovered that the YARN app has already terminated with FINISHED 
> state and KILLED final status, like below.
>
> And after digging into the log of this finished YARN app, I have found that 
> TaskManager had already received the SIGTERM signal and terminated gracefully.
> org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15: SIGTERM. 
> Shutting down as requested.
>
> Also, the log of JobManager shows that it terminated with exit code 0.
> Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit 
> code 0
>
> However, the JobManager did not return anything to the client before its 
> shutdown, which is different from previous versions (like Flink 1.9).
>
> I wonder if this is a new bug on the flink-clients or flink-yarn module?
>
> Thank you : )
>
> Sincerely,
> Weike

Reply via email to