For network issue, this answer might help
http://mail-archives.apache.org/mod_mbox/flink-user/201907.mbox/%3cdb49d6f2-1a6b-490a-8502-edd9562b0163.yungao...@aliyun.com%3E
.

Thanks,
Zhu Zhu

Karthick Thanigaimani <kargold...@yahoo.com> 于2019年8月2日周五 下午5:34写道:

> Yes Zhu, the server is online and it didn't get died. So it looks like to
> be a network issue but we checked all other things like ports / SG etc
> which looks fine.
>
>
> On Friday, 2 August, 2019, 06:08:37 pm GMT+10, Zhu Zhu <reed...@gmail.com>
> wrote:
>
>
> Hi Karthick,
>
> Could you check whether the `lost` TM 'flink-taskmanager-b/<IP>2:6121' is
> still alive when the error is reported?
> If it is still alive then, this seems to be a network issue between
> 'flink-taskmanager-c:6121' and 'flink-taskmanager-b/<IP>2:6121'.
> Otherwise, it is needed to check why the TM exits, for internal
> failure(e.g. cancel timeout) or killed by other services(e.g. K8S). I'm not
> familiar with K8S so I'm not sure in which case the cluster may kill the TM.
>
> Thanks,
> Zhu Zhu
>
> Karthick Thanigaimani <kargold...@yahoo.com> 于2019年8月2日周五 下午2:48写道:
>
> Thanks Zhu. That's correct but we don't see any errors in the log of that
> TM or the server logs or the Kube logs. Is there any bug in flink 1.4 that
> causes this or settings in flink that can avoid this.
>
>
> Regards
> Karthick
>
> n Fri, 2 Aug. 2019 at 16:04, Zhu Zhu
>
> <reed...@gmail.com> wrote:
> Hi Karthick,
>
> From the log seems the TM "flink-taskmanager-b/<IP>2:6121" is lost
> unexpectedly.
> You may need to check the log of that TM to see why it exits, which should
> be the root cause.
>
> Thanks,
> Zhu Zhu
>
> Karthick Thanigaimani <kargold...@yahoo.com.invalid> 于2019年8月2日周五
> 下午1:54写道:
>
> Hi Team,
> We are facing frequent issues with the Flink job manager in one
> environment when the processing happens.
> CHAIN Join(Remap EDGES id: TO) -> Map (Key Extractor) -> Combine
> (Deduplicate edges including bi-directional edges) (57/80)Timestamp:
> 2019-08-02, 4:13:25 Location: flink-taskmanager-c:6121
>
> We have tried changing the EC2 sizes to a bigger one and increased heap
> size etc but still the same problem. The below is the error message that we
> see Could someone provide some guidance.
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Lost connection to task manager 'flink-taskmanager-b/<IP>2:6121'. This
> indicates that the remote task manager was lost. at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException:
> Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native
> Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at
> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at
> sun.nio.ch.IOUtil.read(IOUtil.java:192) at
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at
> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
> at
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> ... 6 more
>
>
> RegardsKarthtick
>
>

Reply via email to