Hi Wenrui, I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.
Best, Zhijiang ------------------------------------------------------------------ From:Wenrui Meng <[email protected]> Send Time:2019年4月13日(星期六) 01:01 To:user <[email protected]> Cc:tzulitai <[email protected]> Subject:Netty channel closed at AKKA gated status We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception 2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 3758 (3788228399 bytes in 5967 ms). 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED. org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:748)
