Hi Friends:
We noticed the following in 'pyspark' happens when running in
distributed Standalone Mode (MASTER=spark://vps00:7077),
but not in Local Mode (MASTER=local[n]).
See the following, particularly what is highlighted in *Red* (again the
problem only happens in Standalone Mode).
Any ideas? Thank you in advance! =:)
>>>
>>> rdd = sc.textFile('file:///etc/hosts')
>>> rdd.first()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
rs = self.take(1)
File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
self.target_id, self.name)
File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3
in stage 1.0
(TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback
(most recent call last):
File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
process()
File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
dump_stream
vs = list(itertools.islice(iterator, batch))
File *"/usr/lib/spark/python/pyspark/rdd.py", line 1106*, in
takeUpToNumLeft <--- *See around line _1106_ of this file in the CDH5
Spark Distribution*.
while taken < left:
*ImportError: No module named iter*
>>> # But *iter()* exists as a built-in (not as a module) ...
>>> iter(range(10))
<listiterator object at 0x423ff10>
>>>
cluster$ rpm -qa | grep -i spark
[ ... ]
spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
Thank you!
Team Prismalytics