using zip gets EOFError error

chocjy Sat, 15 Nov 2014 10:08:04 -0800

I was trying to zip the rdd with another rdd. I store my matrix in HDFS and
load it as Ab_rdd = sc.textFile('data/Ab.txt', 100)


If I do
idx = sc.parallelize(range(m),100)  #m is the number of records in Ab.txt
print matrix_Ab.matrix.zip(idx).first()

I got the following error:

If I store my matrix (Ab.txt) locally and use sc.parallelize to create the
rdd, this error doesn’t appear. Anyone knows what's going on? Thanks!

Traceback (most recent call last):
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/l2_exp.py", line
51, in <module>
    print test_obj.execute_l2(matrix_Ab,A,b,x_opt,f_opt)
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/test_l2.py", line
35, in execute_l2
    ls.fit()
  File
"/home/jiyan/randomized-matrix-algorithms/spark/src/least_squares.py", line
23, in fit
    x = self.projection.execute(self.matrix_Ab, 'solve')
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 26, in execute
    PA = self.__project(matrix, lim)
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 50, in __project
    print matrix.zip_with_index(self.sc).first()
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 881, in first
    return self.take(1)[0]
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 868, in take
    iterator =
mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o37.collectPartitions.
: java.lang.ClassCastException: [B cannot be cast to java.lang.String
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)



PySpark worker failed with exception:
Traceback (most recent call last):
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
    command = pickleSer._read_with_length(infile)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
    length = read_int(stream)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
    raise EOFError
EOFError

14/11/15 00:36:17 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
    command = pickleSer._read_with_length(infile)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
    length = read_int(stream)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
    raise EOFError
EOFError

        at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:118)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:148)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:81)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
        at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)
Caused by: java.lang.ClassCastException: [B cannot be cast to
java.lang.String
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
14/11/15 00:36:17 ERROR PythonRDD: This may have been caused by a prior
exception:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
        at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
        at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
14/11/15 00:36:17 INFO DAGScheduler: Failed to run first at
/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py:50



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-zip-gets-EOFError-error-tp19011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

using zip gets EOFError error

Reply via email to