Hi Flavio, can you privately share the source code of your Flink job with me?
I'm wondering whether the issue might be caused by a version mixup between different versions on the cluster (different JVM versions? or different files in the lib/ folder?), How are you deploying the Flink job? Regards, Robert On Mon, May 30, 2016 at 11:33 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > I tried to reproduce the error on a subset of the data and actually > reducing the available memory and increasing a lot the gc (creating a lot > of useless objects in one of the first UDFs) caused this error: > > Caused by: java.io.IOException: Thread 'SortMerger spilling thread' > terminated due to an exception: / by zero > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) > Caused by: java.lang.ArithmeticException: / by zero > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.getSegmentsForReaders(UnilateralSortMerger.java:1651) > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.mergeChannelList(UnilateralSortMerger.java:1565) > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1417) > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796) > > I hope this could help to restrict the debugging area :) > > Best, > Flavio > > On Fri, May 27, 2016 at 8:21 PM, Stephan Ewen <se...@apache.org> wrote: > >> Hi! >> >> That is a pretty thing indeed :-) Will try to look into this in a few >> days... >> >> Stephan >> >> >> On Fri, May 27, 2016 at 12:10 PM, Flavio Pompermaier < >> pomperma...@okkam.it> wrote: >> >>> Running the job with log level set to DEBUG made the job run >>> successfully...Is this meaningful..? Maybe slowing down a little bit the >>> threads could help serialization? >>> >>> >>> On Thu, May 26, 2016 at 12:34 PM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> >>>> Still not able to reproduce the error locally but remotly :) >>>> Any suggestions about how to try to reproduce it locally on a subset of >>>> the data? >>>> This time I had: >>>> >>>> com.esotericsoftware.kryo.KryoException: Unable to find class: ^Z^A >>>> at >>>> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) >>>> at >>>> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) >>>> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641) >>>> at >>>> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752) >>>> at >>>> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:228) >>>> at >>>> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:431) >>>> at >>>> org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55) >>>> at >>>> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:124) >>>> at >>>> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:65) >>>> at >>>> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34) >>>> at >>>> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73) >>>> at >>>> org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:96) >>>> at >>>> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480) >>>> at >>>> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345) >>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> Best, >>>> Flavio >>>> >>>> >>>> On Tue, May 24, 2016 at 5:47 PM, Flavio Pompermaier < >>>> pomperma...@okkam.it> wrote: >>>> >>>>> Do you have any suggestion about how to reproduce the error on a >>>>> subset of the data? >>>>> I'm trying changing the following but I can't find a configuration >>>>> causing the error :( >>>>> >>>>> rivate static ExecutionEnvironment getLocalExecutionEnv() { >>>>> org.apache.flink.configuration.Configuration c = new >>>>> org.apache.flink.configuration.Configuration(); >>>>> c.setString(ConfigConstants.TASK_MANAGER_TMP_DIR_KEY, "/tmp"); >>>>> c.setString(ConfigConstants.BLOB_STORAGE_DIRECTORY_KEY,"/tmp"); >>>>> c.setFloat(ConfigConstants.TASK_MANAGER_MEMORY_FRACTION_KEY, >>>>> 0.9f); >>>>> c.setLong(ConfigConstants.LOCAL_NUMBER_TASK_MANAGER, 4); >>>>> c.setLong(ConfigConstants.TASK_MANAGER_NUM_TASK_SLOTS, 4); >>>>> c.setString(ConfigConstants.AKKA_ASK_TIMEOUT, "10000 s"); >>>>> >>>>> c.setLong(ConfigConstants.TASK_MANAGER_NETWORK_NUM_BUFFERS_KEY, 2048 * >>>>> 12); >>>>> ExecutionEnvironment env = >>>>> ExecutionEnvironment.createLocalEnvironment(c); >>>>> env.setParallelism(16); >>>>> env.registerTypeWithKryoSerializer(DateTime.class, >>>>> JodaDateTimeSerializer.class ); >>>>> return env; >>>>> } >>>>> >>>>> Best, >>>>> Flavio >>>>> >>>>> >>>>> On Tue, May 24, 2016 at 11:13 AM, Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> The error look really strange. Flavio, could you compile a test >>>>>> program with example data and configuration to reproduce the problem. >>>>>> Given >>>>>> that, we could try to debug the problem. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>> >>>>> >>>> >>> >> >