I was trying to zip the rdd with another rdd. I store my matrix in HDFS and load it as Ab_rdd = sc.textFile('data/Ab.txt', 100)
If I do idx = sc.parallelize(range(m),100) #m is the number of records in Ab.txt print matrix_Ab.matrix.zip(idx).first() I got the following error: If I store my matrix (Ab.txt) locally and use sc.parallelize to create the rdd, this error doesn’t appear. Anyone knows what's going on? Thanks! Traceback (most recent call last): File "/home/jiyan/randomized-matrix-algorithms/spark/src/l2_exp.py", line 51, in <module> print test_obj.execute_l2(matrix_Ab,A,b,x_opt,f_opt) File "/home/jiyan/randomized-matrix-algorithms/spark/src/test_l2.py", line 35, in execute_l2 ls.fit() File "/home/jiyan/randomized-matrix-algorithms/spark/src/least_squares.py", line 23, in fit x = self.projection.execute(self.matrix_Ab, 'solve') File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py", line 26, in execute PA = self.__project(matrix, lim) File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py", line 50, in __project print matrix.zip_with_index(self.sc).first() File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py", line 881, in first return self.take(1)[0] File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py", line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__ File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o37.collectPartitions. : java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) PySpark worker failed with exception: Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py", line 73, in main command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py", line 142, in _read_with_length length = read_int(stream) File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py", line 337, in read_int raise EOFError EOFError 14/11/15 00:36:17 ERROR PythonRDD: Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py", line 73, in main command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py", line 142, in _read_with_length length = read_int(stream) File "/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py", line 337, in read_int raise EOFError EOFError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:118) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:148) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:81) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559) Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) 14/11/15 00:36:17 ERROR PythonRDD: This may have been caused by a prior exception: java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) 14/11/15 00:36:17 INFO DAGScheduler: Failed to run first at /home/jiyan/randomized-matrix-algorithms/spark/src/projections.py:50 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-zip-gets-EOFError-error-tp19011.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org