Hi Friends:

We noticed the following in 'pyspark' happens when running in distributed Standalone Mode (MASTER=spark://vps00:7077),
but not in Local Mode (MASTER=local[n]).

See the following, particularly what is highlighted in *Red* (again the problem only happens in Standalone Mode).
Any ideas? Thank you in advance! =:)

>>>
>>> rdd = sc.textFile('file:///etc/hosts')
>>> rdd.first()

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
    rs = self.take(1)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
    res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
    self.target_id, self.name)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
    format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
    process()
  File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
File *"/usr/lib/spark/python/pyspark/rdd.py", line 1106*, in takeUpToNumLeft <--- *See around line _1106_ of this file in the CDH5 Spark Distribution*.
    while taken < left:
*ImportError: No module named iter*

>>> # But *iter()* exists as a built-in (not as a module) ...
>>> iter(range(10))
<listiterator object at 0x423ff10>
>>>

cluster$ rpm -qa | grep -i spark
[ ... ]
spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch


Thank you!
Team Prismalytics

Reply via email to