There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.
On Sun, Apr 14, 2019 at 7:22 PM zhijiang <wangzhijiang...@aliyun.com> wrote: > Hi Wenrui, > > I think the akka gated issue and inactive netty channel are both caused by > some task manager exits/killed. You should double check the status and > reason of this task manager `'athena592-phx2/10.80.118.166:44177'`. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:Wenrui Meng <wenruim...@gmail.com> > Send Time:2019年4月13日(星期六) 01:01 > To:user <user@flink.apache.org> > Cc:tzulitai <tzuli...@apache.org> > Subject:Netty channel closed at AKKA gated status > > We encountered the netty channel inactive issue while the AKKA gated that > task manager. I'm wondering whether the channel closed because of the AKKA > gated status, since all message to the taskManager will be dropped at that > moment, which might cause netty channel exception. If so, shall we have > coordination between AKKA and Netty? The gated status is not intended to > fail the system. Here is the stack trace fthe or exception > > 2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 3758 (3788228399 bytes in 5967 ms). > 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-25 - Association with remote > system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-25 - Association with remote > system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - id (14/96) > (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED. > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager 'athena592-phx2/ > 10.80.118.166:44177'. This might indicate that the remote task manager > was lost. > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:748) > > >