For network issue, this answer might help http://mail-archives.apache.org/mod_mbox/flink-user/201907.mbox/%3cdb49d6f2-1a6b-490a-8502-edd9562b0163.yungao...@aliyun.com%3E .
Thanks, Zhu Zhu Karthick Thanigaimani <kargold...@yahoo.com> 于2019年8月2日周五 下午5:34写道: > Yes Zhu, the server is online and it didn't get died. So it looks like to > be a network issue but we checked all other things like ports / SG etc > which looks fine. > > > On Friday, 2 August, 2019, 06:08:37 pm GMT+10, Zhu Zhu <reed...@gmail.com> > wrote: > > > Hi Karthick, > > Could you check whether the `lost` TM 'flink-taskmanager-b/<IP>2:6121' is > still alive when the error is reported? > If it is still alive then, this seems to be a network issue between > 'flink-taskmanager-c:6121' and 'flink-taskmanager-b/<IP>2:6121'. > Otherwise, it is needed to check why the TM exits, for internal > failure(e.g. cancel timeout) or killed by other services(e.g. K8S). I'm not > familiar with K8S so I'm not sure in which case the cluster may kill the TM. > > Thanks, > Zhu Zhu > > Karthick Thanigaimani <kargold...@yahoo.com> 于2019年8月2日周五 下午2:48写道: > > Thanks Zhu. That's correct but we don't see any errors in the log of that > TM or the server logs or the Kube logs. Is there any bug in flink 1.4 that > causes this or settings in flink that can avoid this. > > > Regards > Karthick > > n Fri, 2 Aug. 2019 at 16:04, Zhu Zhu > > <reed...@gmail.com> wrote: > Hi Karthick, > > From the log seems the TM "flink-taskmanager-b/<IP>2:6121" is lost > unexpectedly. > You may need to check the log of that TM to see why it exits, which should > be the root cause. > > Thanks, > Zhu Zhu > > Karthick Thanigaimani <kargold...@yahoo.com.invalid> 于2019年8月2日周五 > 下午1:54写道: > > Hi Team, > We are facing frequent issues with the Flink job manager in one > environment when the processing happens. > CHAIN Join(Remap EDGES id: TO) -> Map (Key Extractor) -> Combine > (Deduplicate edges including bi-directional edges) (57/80)Timestamp: > 2019-08-02, 4:13:25 Location: flink-taskmanager-c:6121 > > We have tried changing the EC2 sizes to a bigger one and increased heap > size etc but still the same problem. The below is the error message that we > see Could someone provide some guidance. > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Lost connection to task manager 'flink-taskmanager-b/<IP>2:6121'. This > indicates that the remote task manager was lost. at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: > Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native > Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at > sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at > sun.nio.ch.IOUtil.read(IOUtil.java:192) at > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at > org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > ... 6 more > > > RegardsKarthtick > >