Hi Wenrui,

I think you could trace the log of node manager which contains the lifecycle of 
this task executor. Maybe this task executor is killed by node manager because 
of memory overuse.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <wenruim...@gmail.com>
Send Time:2019年4月20日(星期六) 09:48
To:zhijiang <wangzhijiang...@aliyun.com>
Cc:Biao Liu <mmyy1...@gmail.com>; user <user@flink.apache.org>; tzulitai 
<tzuli...@apache.org>
Subject:Re: Netty channel closed at AKKA gated status

Attached the lost task manager last 10000 lines log. Anyone can help take a 
look? 

Thanks,
Wenrui
On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <wenruim...@gmail.com> wrote:
Looked at a few same instances. The lost task manager was indeed not active 
anymore since there is no log for that task manager printed after the 
connection issue timestamp. I guess somehow that task manager died silently 
without exception or termination relevant information logged. I double checked 
the lost task manager host, the GC, CPU, memory, network, disk I/O all look 
good without any spike. Is there any other possibility that the task manager 
can be terminated? We run our jobs in the yarn cluster. 
On Mon, Apr 15, 2019 at 10:47 PM zhijiang <wangzhijiang...@aliyun.com> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between 
job master and target task executor if you confirm the target task executor is 
still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <mmyy1...@gmail.com>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <wenruim...@gmail.com>
Cc:zhijiang <wangzhijiang...@aliyun.com>; user <user@flink.apache.org>; 
tzulitai <tzuli...@apache.org>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. 
If the task manager exits since connection timeout, there would be something in 
log file. So it is probably killed by other user or operating system. Please 
check the log of operating system. BTW, I don't think "DEBUG log level" would 
help.
Wenrui Meng <wenruim...@gmail.com> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager 
`'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut 
down either in cluster monitor dashboard. It probably requires to turn on DEBUG 
log to get more useful information. If the task manager gets killed, I assume 
there will be terminating log in the task manager log. If not, I don't know how 
to figure out whether it's due to task manager gets killed or just a connection 
timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <wangzhijiang...@aliyun.com> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some 
task manager exits/killed. You should double check the status and reason of 
this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <wenruim...@gmail.com>
Send Time:2019年4月13日(星期六) 01:01
To:user <user@flink.apache.org>
Cc:tzulitai <tzuli...@apache.org>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task 
manager. I'm wondering whether the channel closed because of the AKKA gated 
status, since all message to the taskManager will be dropped at that moment, 
which might cause netty channel exception. If so, shall we have coordination 
between AKKA and Netty? The gated status is not intended to fail the system. 
Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed 
checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-25 - Association with remote system 
[akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for 
[5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-25 - Association with remote system 
[akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for 
[5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) 
(93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Connection unexpectedly closed by remote task manager 
'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task 
manager was lost.
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)



Reply via email to