I am currently using 0.9-SNAPSHOT. All the non-jar files are from an older 
build, but I recently manually updated the flink-dist.jar with commit 
d163a817fa2e330e86384d0bbcd104f051a6fb48.

Our setup consists of 10 workers and a master, all interconnected via a switch 
(100 Mbit, I think). The data set is an NTriple file of about 8 GB, however, 
intermediate datasets might be much larger. However, for smaller datasets, I 
could not observe this problem, yet.

Also, during the failure there are a lot of concurrent shuffles ongoing [1,2]. 
Additionally, it might be interesting that in the affected jobs either this 
exception occurs or another one that looks like a network disruption. [3] So, 
it might well be, that our setup suffers from occasional network errors, 
especially during high network load - but that’s just a plain guess.

Regarding the reproducibility, I can only say right now, that this error 
occurred twice. I will re-run the jobs and see, if the errors can be observed 
again and let you know.

Cheers,
Sebastian

[1] 
https://owncloud.hpi.de/public.php?service=files&t=89dcba7ce63ed053331f5099cad64704
[2] 
https://owncloud.hpi.de/public.php?service=files&t=1b16d66d6b9e41c68bf598309d57b7b1
[3] Stack trace: tenem18.hpi.uni-potsdam.de
Error: java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at 
de.hpi.isg.sodap.rdfind.plan.AllAtOnceTraversalStrategy.enhanceFlinkPlan(AllAtOnceTraversalStrategy.scala:56))
 -> Filter (Filter at 
de.hpi.isg.sodap.rdfind.plan.AllAtOnceTraversalStrategy.enhanceFlinkPlan(AllAtOnceTraversalStrategy.scala:68))
 -> FlatMap (FlatMap at 
de.hpi.isg.sodap.rdfind.plan.TraversalStrategy$$anonfun$1.apply(TraversalStrategy.scala:46))
 -> Map (Map at 
de.hpi.isg.sodap.rdfind.plan.TraversalStrategy$class.splitAndCleanCindSets(TraversalStrategy.scala:58))'
 , caused an error: Error obtaining the sorted input: Thread 'SortMerger 
Reading Thread' terminated due to an exception: null
at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:471)
at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: null
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:607)
at 
org.apache.flink.runtime.operators.RegularPactTask.getInput(RegularPactTask.java:1145)
at 
org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:466)
... 3 more
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
due to an exception: null
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:784)
Caused by: java.io.EOFException
at 
org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:290)
at org.apache.flink.types.StringValue.readString(StringValue.java:741)
at 
org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:68)
at 
org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:28)
at 
org.apache.flink.api.scala.typeutils.CaseClassSerializer.deserialize(CaseClassSerializer.scala:102)
at 
org.apache.flink.api.scala.typeutils.CaseClassSerializer.deserialize(CaseClassSerializer.scala:29)
at 
org.apache.flink.api.common.typeutils.base.GenericArraySerializer.deserialize(GenericArraySerializer.java:123)
at 
org.apache.flink.api.common.typeutils.base.GenericArraySerializer.deserialize(GenericArraySerializer.java:33)
at 
org.apache.flink.api.scala.typeutils.CaseClassSerializer.deserialize(CaseClassSerializer.scala:102)
at 
org.apache.flink.api.scala.typeutils.CaseClassSerializer.deserialize(CaseClassSerializer.scala:95)
at 
org.apache.flink.api.scala.typeutils.CaseClassSerializer.deserialize(CaseClassSerializer.scala:29)
at 
org.apache.flink.runtime.plugable.ReusingDeserializationDelegate.read(ReusingDeserializationDelegate.java:57)
at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:133)
at 
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:64)
at 
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
at 
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ReadingThread.go(UnilateralSortMerger.java:1020)
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:781)


-----Original Message-----
From: Ufuk Celebi [mailto:u...@apache.org] 
Sent: Mittwoch, 3. Juni 2015 10:33
To: dev@flink.apache.org
Subject: Re: Buffer re-ordering problem

This is a critical bug.

- which version are you using? If snapshot, which commit?
- what is your setup? Number of machines, datset etc?
- is it reproducible?

On Wednesday, June 3, 2015, Kruse, Sebastian <sebastian.kr...@hpi.de> wrote:

> Hi everyone,
>
> I had some jobs running over the night and in two of them after about 
> half an hour the following exception occurred. Do you know why this happens?
>
> Thanks,
> Sebastian
>
> tenem16.hpi.uni-potsdam.de
> Error: java.lang.Exception: The data preparation for task 'CHAIN 
> GroupReduce (GroupReduce at
> de.hpi.isg.sodap.rdfind.plan.AllAtOnceTraversalStrategy.enhanceFlinkPl
> an(AllAtOnceTraversalStrategy.scala:56))
> -> Filter (Filter at
> de.hpi.isg.sodap.rdfind.plan.AllAtOnceTraversalStrategy.enhanceFlinkPl
> an(AllAtOnceTraversalStrategy.scala:68))
> -> FlatMap (FlatMap at
> de.hpi.isg.sodap.rdfind.plan.TraversalStrategy$$anonfun$1.apply(Traver
> salStrategy.scala:46))
> -> Map (Map at
> de.hpi.isg.sodap.rdfind.plan.TraversalStrategy$class.splitAndCleanCindSets(TraversalStrategy.scala:58))'
> , caused an error: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Buffer 
> re-ordering:
> expected buffer with sequence number 17841, but received 17842.
> at
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask
> .java:471)
> at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactT
> ask.java:362) at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
> Thread 'SortMerger Reading Thread' terminated due to an exception: 
> Buffer
> re-ordering: expected buffer with sequence number 17841, but received 17842.
> at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterat
> or(UnilateralSortMerger.java:607)
> at
> org.apache.flink.runtime.operators.RegularPactTask.getInput(RegularPac
> tTask.java:1145)
> at
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupRedu
> ceDriver.java:94)
> at
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask
> .java:466)
> ... 3 more
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
> terminated due to an exception: Buffer re-ordering: expected buffer 
> with sequence number 17841, but received 17842.
> at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBas
> e.run(UnilateralSortMerger.java:784)
> Caused by:
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel$BufferReorderingException:
> Buffer re-ordering: expected buffer with sequence number 17841, but 
> received 17842.
> at
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChan
> nel.onBuffer(RemoteInputChannel.java:253)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandle
> r.decodeBufferOrEvent(PartitionRequestClientHandler.java:279)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandle
> r.decodeMsg(PartitionRequestClientHandler.java:214)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandle
> r.channelRead(PartitionRequestClientHandler.java:158)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstr
> actChannelHandlerContext.java:339)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstrac
> tChannelHandlerContext.java:324)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMe
> ssageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstr
> actChannelHandlerContext.java:339)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstrac
> tChannelHandlerContext.java:324)
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageD
> ecoder.java:242)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Abstr
> actChannelHandlerContext.java:339)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Abstrac
> tChannelHandlerContext.java:324)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannel
> Pipeline.java:847)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstrac
> tNioByteChannel.java:131)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java
> :511)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEven
> tLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.jav
> a:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadE
> ventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)
>
>
> ---
> Sebastian Kruse
> Doktorand am Fachbereich Information Systems Group 
> Hasso-Plattner-Institut an der Universität Potsdam 
> Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam Tel +49 331 5509 240 
> Amtsgericht Potsdam, HRB 12184
> Geschäftsführung: Prof. Dr. Christoph Meinel
>
>
>

Reply via email to