Hi Yangze and all,

I have tried numerous times, and this behavior persists.

Below is the tail log of taskmanager.log:

2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl  - Free slot
TaskSlot(index:0, state:ACTIVE, resource profile:
ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.503gb
(1613968148 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.403gb
(1505922928 bytes), networkMemory=359.040mb (376480732 bytes)},
allocationId: d3acaeac3db62454742e800b5410adfd, jobId:
d0a674795be98bd2574d9ea3286801cb).
2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.taskexecutor.JobLeaderService  - Remove job
d0a674795be98bd2574d9ea3286801cb from job leader monitoring.
2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor  - Close JobManager
connection for job d0a674795be98bd2574d9ea3286801cb.
2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor  - Close JobManager
connection for job d0a674795be98bd2574d9ea3286801cb.
2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO
 org.apache.flink.runtime.taskexecutor.JobLeaderService  - Cannot reconnect
to job d0a674795be98bd2574d9ea3286801cb because it is not registered.
2020-03-13 12:06:19.744 [SIGTERM handler] INFO
 org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
SIGTERM. Shutting down as requested.
2020-03-13 12:06:19.744 [SIGTERM handler] INFO
 org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
SIGTERM. Shutting down as requested.
2020-03-13 12:06:19.745 [PermanentBlobCache shutdown hook] INFO
 org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB
cache
2020-03-13 12:06:19.749 [FileChannelManagerImpl-netty-shuffle shutdown
hook] INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl  -
FileChannelManager removed spill file directory
/data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-netty-shuffle-65cd4ebb-51f4-48a9-8e3c-43e431bca46d
2020-03-13 12:06:19.750 [TaskExecutorLocalStateStoresManager shutdown hook]
INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
Shutting down TaskExecutorLocalStateStoresManager.
2020-03-13 12:06:19.750 [TransientBlobCache shutdown hook] INFO
 org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB
cache
2020-03-13 12:06:19.751 [FileChannelManagerImpl-io shutdown hook] INFO
 org.apache.flink.runtime.io.disk.FileChannelManagerImpl  -
FileChannelManager removed spill file directory
/data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-io-67ad5c3a-aec6-42be-ab1f-0ce3841fc4bd
2020-03-13 12:06:19.752 [FileCache shutdown hook] INFO
 org.apache.flink.runtime.filecache.FileCache  - removed file cache
directory
/data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-dist-cache-65075ee3-e009-4978-a9d8-ec010e6f4b31

As the tail log of jobmanager.log is kind of lengthy, I have attached it in
this mail.

>From what I have seen, the TaskManager and JobManager shut down by
themselves, however, I have noticed some Netty exceptions (from the stack
trace, it is part of the REST handler) like:

ERROR
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution
 - Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated

Thus I suppose that these exceptions might be the actual cause of premature
termination of the REST server, and I am still looking into the real cause
of this.

Best,
Weike

On Fri, Mar 13, 2020 at 1:45 PM Yangze Guo <karma...@gmail.com> wrote:

> Would you mind to share more information about why the task executor
> is killed? If it is killed by Yarn, you might get such info in Yarn
> NM/RM logs.
>
> Best,
> Yangze Guo
>
> Best,
> Yangze Guo
>
>
> On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk>
> wrote:
> >
> > Hi,
> >
> > Recently I have encountered a strange behavior of Flink on YARN, which
> is that when I try to cancel a Flink job running in per-job mode on YARN
> using commands like
> >
> > "cancel -m yarn-cluster -yid application_1559388106022_9412
> ed7e2e0ab0a7316c1b65df6047bc6aae"
> >
> > the client happily found and connected to ResourceManager and then
> stucks at
> > Found Web Interface 172.28.28.3:50099 of application
> 'application_1559388106022_9412'.
> >
> > And after one minute, an exception is thrown at the client side:
> > Caused by: org.apache.flink.util.FlinkException: Could not cancel job
> ed7e2e0ab0a7316c1b65df6047bc6aae.
> >     at
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
> >     at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
> >     at
> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
> >     at
> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
> >     at
> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:422)
> >     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> >     ... 20 more
> > Caused by: java.util.concurrent.TimeoutException
> >     at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> >     at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> >     at
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
> >     ... 27 more
> >
> > Then I discovered that the YARN app has already terminated with FINISHED
> state and KILLED final status, like below.
> >
> > And after digging into the log of this finished YARN app, I have found
> that TaskManager had already received the SIGTERM signal and terminated
> gracefully.
> > org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
> SIGTERM. Shutting down as requested.
> >
> > Also, the log of JobManager shows that it terminated with exit code 0.
> > Terminating cluster entrypoint process YarnJobClusterEntrypoint with
> exit code 0
> >
> > However, the JobManager did not return anything to the client before its
> shutdown, which is different from previous versions (like Flink 1.9).
> >
> > I wonder if this is a new bug on the flink-clients or flink-yarn module?
> >
> > Thank you : )
> >
> > Sincerely,
> > Weike
>

Attachment: jobmanager.log
Description: Binary data

Reply via email to