Hello,
I am running spark 1.5.1 on EMR using Python 3.
I have a pyspark job which is doing some simple joins and reduceByKey
operations. It works fine most of the time, but sometimes I get the
following error:
15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID
69, ip-172-31-8-142.ap-southeast-1.compute.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
File
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
File
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_000003/pyspark.zip/pyspark/serializers.py",
line 133, in dump_stream
for obj in iterator:
File "/usr/lib/spark/python/pyspark/rdd.py", line 1723, in add_shuffle_key
OverflowError: cannot convert float infinity to integer
at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The line in question is:
https://github.com/apache/spark/blob/4f894dd6906311cb57add6757690069a18078783/python/pyspark/rdd.py#L1723
I'm having a hard time seeing how `batch` could ever be set to infinity.
The error it is also inconsistent to reproduce.
Help?