java.lang.NegativeArraySizeException in pyspark

Brad Miller Sat, 20 Sep 2014 12:43:19 -0700

Hi All,

I'm experiencing a java.lang.NegativeArraySizeException in a pyspark script
I have.  I've pasted the full traceback at the end of this email.


I have isolated the line of code in my script which "causes" the exception
to occur. Although the exception seems to occur deterministically, it is
very unclear why the different variants of the line would cause the
exception to occur. Unfortunately, I am only able to reproduce the bug in
the context of a large data processing job, and the line of code which must
change to reproduce the bug has little meaning out of context.  The bug
occurs when I call "map" on an RDD with a function that references some
state outside of the RDD (which is presumably bundled up and distributed
with the function).  The output of the function is a tuple where the first
element is an int and the second element is a list of floats (same positive
length every time, as verified by an 'assert' statement).

Given that:
-It's unclear why changes in the line would cause an exception
-The exception comes from within pyspark code
-The exception has to do with negative array sizes (and I couldn't have
created a negative sized array anywhere in my python code)
I suspect this is a bug in pyspark.

Has anybody else observed or reported this bug?

best,
-Brad

Traceback (most recent call last):
  File "/home/bmiller1/pipeline/driver.py", line 214, in <module>
    main()
  File "/home/bmiller1/pipeline/driver.py", line 203, in main
    bl.write_results(iteration_out_dir)
  File "/home/bmiller1/pipeline/layer/svm_layer.py", line 137, in
write_results
    fig, accuracy = _get_results(self.prediction_rdd)
  File "/home/bmiller1/pipeline/layer/svm_layer.py", line 56, in
_get_results
    predictions = np.array(prediction_rdd.collect())
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.py", line
723, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.py", line
2026, in _jrdd
    broadcast_vars, self.ctx._javaAccumulator)
  File
"/home/spark/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 701, in __call__
  File
"/home/spark/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.python.PythonRDD. Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:66)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:701)

java.lang.NegativeArraySizeException in pyspark

Reply via email to