Hi, Do you have an exception for the the CoGroup failure?
Best, Fabian 2017-07-26 3:32 GMT+02:00 Charith Wickramarachchi < charith.dhanus...@gmail.com>: > I did some more digging. It seems the CoGroup operation failed in one of > the workers. But I do not face this issue when running other tasks. > > Thanks, > Charith > > On Tue, Jul 25, 2017 at 2:06 PM, Charith Wickramarachchi < > charith.dhanus...@gmail.com> wrote: > >> Hi All, >> >> I m getting an exception when running a Gelly task using Pregel model. It >> complains that the remote task manager might be lost. But task managers >> seem to be active based on the flink dashboard. Also, other tasks run fine >> without an issue. >> >> >> Following is the summary of exception trace. I have attached the full >> trace as well. It will be great if you can provide any directions to >> identify the issue. >> >> >> Flink version: flink-1.1.3 >> Java: 1.7 >> >> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' >> terminated due to an exception: Connecting the channel failed: Connecting >> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might >> indicate that the remote task manager has been lost. >> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger >> $ThreadBase.run(UnilateralSortMerger.java:800) >> Caused by: java.io.IOException: Connecting the channel failed: Connecting >> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might >> indicate that the remote task manager has been lost. >> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl >> ientFactory$ConnectingChannel.waitForChannel(PartitionReques >> tClientFactory.java:196) >> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl >> ientFactory$ConnectingChannel.access$000(PartitionRequestCli >> entFactory.java:131) >> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl >> ientFactory.createPartitionRequestClient(PartitionRequestCli >> entFactory.java:83) >> at org.apache.flink.runtime.io.network.netty.NettyConnectionMan >> ager.createPartitionRequestClient(NettyConnectionManager.java:60) >> at org.apache.flink.runtime.io.network.partition.consumer.Remot >> eInputChannel.requestSubpartition(RemoteInputChannel.java:118) >> at org.apache.flink.runtime.io.network.partition.consumer.Singl >> eInputGate.requestPartitions(SingleInputGate.java:394) >> at org.apache.flink.runtime.io.network.partition.consumer.Singl >> eInputGate.getNextBufferOrEvent(SingleInputGate.java:413) >> at org.apache.flink.runtime.io.network.api.reader.AbstractRecor >> dReader.getNextRecord(AbstractRecordReader.java:87) >> at org.apache.flink.runtime.io.network.api.reader.MutableRecord >> Reader.next(MutableRecordReader.java:42) >> at org.apache.flink.runtime.operators.util.ReaderIterator.next( >> ReaderIterator.java:59) >> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger >> $ReadingThread.go(UnilateralSortMerger.java:973) >> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger >> $ThreadBase.run(UnilateralSortMerger.java:796) >> Caused by: >> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: >> Connecting to remote task manager + 'worker/127.0.1.1:44310' has failed. >> This might indicate that the remote task manager has been lost. >> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl >> ientFactory$ConnectingChannel.operationComplete(PartitionReq >> uestClientFactory.java:219) >> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl >> ientFactory$ConnectingChannel.operationComplete(PartitionReq >> uestClientFactory.java:131) >> >> Thanks, >> Charith >> >> >> -- >> Charith Dhanushka Wickramaarachchi >> >> Tel +1 213 447 4253 >> Blog http://charith.wickramaarachchi.org/ >> <http://charithwiki.blogspot.com/> >> Twitter @charithwiki <https://twitter.com/charithwiki> >> >> This communication may contain privileged or other confidential information >> and is intended exclusively for the addressee/s. If you are not the >> intended recipient/s, or believe that you may have >> received this communication in error, please reply to the sender indicating >> that fact and delete the copy you received and in addition, you should >> not print, copy, retransmit, disseminate, or otherwise use the >> information contained in this communication. Internet communications >> cannot be guaranteed to be timely, secure, error or virus-free. The >> sender does not accept liability for any errors or omissions >> > > > > -- > Charith Dhanushka Wickramaarachchi > > Tel +1 213 447 4253 > Blog http://charith.wickramaarachchi.org/ > <http://charithwiki.blogspot.com/> > Twitter @charithwiki <https://twitter.com/charithwiki> > > This communication may contain privileged or other confidential information > and is intended exclusively for the addressee/s. If you are not the > intended recipient/s, or believe that you may have > received this communication in error, please reply to the sender indicating > that fact and delete the copy you received and in addition, you should > not print, copy, retransmit, disseminate, or otherwise use the > information contained in this communication. Internet communications > cannot be guaranteed to be timely, secure, error or virus-free. The > sender does not accept liability for any errors or omissions >