Thanks. We find the relevant nodemanager log and figured out the lost task manager killed by the yarn due to memory limit. @zhijiang <wangzhijiang...@aliyun.com> @Biao Liu <mmyy1...@gmail.com> Thanks for your help.
On Sun, Apr 21, 2019 at 11:45 PM zhijiang <wangzhijiang...@aliyun.com> wrote: > Hi Wenrui, > > I think you could trace the log of node manager which contains the > lifecycle of this task executor. Maybe this task executor is killed by node > manager because of memory overuse. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:Wenrui Meng <wenruim...@gmail.com> > Send Time:2019年4月20日(星期六) 09:48 > To:zhijiang <wangzhijiang...@aliyun.com> > Cc:Biao Liu <mmyy1...@gmail.com>; user <user@flink.apache.org>; tzulitai < > tzuli...@apache.org> > Subject:Re: Netty channel closed at AKKA gated status > > Attached the lost task manager last 10000 lines log. Anyone can help take > a look? > > Thanks, > Wenrui > > On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <wenruim...@gmail.com> wrote: > Looked at a few same instances. The lost task manager was indeed not > active anymore since there is no log for that task manager printed after > the connection issue timestamp. I guess somehow that task manager died > silently without exception or termination relevant information logged. I > double checked the lost task manager host, the GC, CPU, memory, network, > disk I/O all look good without any spike. Is there any other possibility > that the task manager can be terminated? We run our jobs in the yarn > cluster. > > On Mon, Apr 15, 2019 at 10:47 PM zhijiang <wangzhijiang...@aliyun.com> > wrote: > Hi Wenrui, > > You might further check whether there exists network connection issue > between job master and target task executor if you confirm the target task > executor is still alive. > > Best, > Zhijiang > ------------------------------------------------------------------ > From:Biao Liu <mmyy1...@gmail.com> > Send Time:2019年4月16日(星期二) 10:14 > To:Wenrui Meng <wenruim...@gmail.com> > Cc:zhijiang <wangzhijiang...@aliyun.com>; user <user@flink.apache.org>; > tzulitai <tzuli...@apache.org> > Subject:Re: Netty channel closed at AKKA gated status > > Hi Wenrui, > If a task manager is killed (kill -9), it would have no chance to log > anything. If the task manager exits since connection timeout, there would > be something in log file. So it is probably killed by other user or > operating system. Please check the log of operating system. BTW, I don't > think "DEBUG log level" would help. > > Wenrui Meng <wenruim...@gmail.com> 于2019年4月16日周二 上午9:16写道: > There is no exception or any warning in the task manager > `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not > shut down either in cluster monitor dashboard. It probably requires to turn > on DEBUG log to get more useful information. If the task manager gets > killed, I assume there will be terminating log in the task manager log. If > not, I don't know how to figure out whether it's due to task manager gets > killed or just a connection timeout. > > > > On Sun, Apr 14, 2019 at 7:22 PM zhijiang <wangzhijiang...@aliyun.com> > wrote: > Hi Wenrui, > > I think the akka gated issue and inactive netty channel are both caused by > some task manager exits/killed. You should double check the status and > reason of this task manager `'athena592-phx2/10.80.118.166:44177'`. > > Best, > Zhijiang > ------------------------------------------------------------------ > From:Wenrui Meng <wenruim...@gmail.com> > Send Time:2019年4月13日(星期六) 01:01 > To:user <user@flink.apache.org> > Cc:tzulitai <tzuli...@apache.org> > Subject:Netty channel closed at AKKA gated status > > We encountered the netty channel inactive issue while the AKKA gated that > task manager. I'm wondering whether the channel closed because of the AKKA > gated status, since all message to the taskManager will be dropped at that > moment, which might cause netty channel exception. If so, shall we have > coordination between AKKA and Netty? The gated status is not intended to > fail the system. Here is the stack trace fthe or exception > > 2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 3758 (3788228399 bytes in 5967 ms). > 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-25 - Association with remote > system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-25 - Association with remote > system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - id (14/96) > (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED. > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager 'athena592-phx2/ > 10.80.118.166:44177'. This might indicate that the remote task manager > was lost. > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:748) > > > >