Not sure if emails with attachments are discarded so re-sending this
without the attachment.

Hello,
  I am trying to run a spark job (which runs fine on the master node of the
cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
a rdd.saveAsTextFile() line in it, I get the following error:

*SystemError: unknown opcode*

The entire stacktrace is at the bottom of this email.

 All the nodes on the cluster have Python 2.7.9 running on them including
the master and all of them have the variable SPARK_PYTHON set to the
anaconda python path. When I try pyspark-shell on these instances they use
anaconda python to open up the spark shell.

I installed anaconda on all slaves after looking at the python version
incompatibility issues mentioned in the following post:


http://glennklockwood.blogspot.com/2014/06/spark-on-supercomputers-few-notes.html

Please let me know what the issue might be.

The spark version we are using is Spark 1.3


StackTrace:

15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
in memory on ip-10-64-10-221.ec2.internal:36266 (size: 5.1 KB, free: 445.4
MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
in memory on ip-10-64-10-222.ec2.internal:33470 (size: 5.1 KB, free: 445.4
MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on ip-10-64-10-221.ec2.internal:36266 (size: 18.8 KB, free: 445.4
MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on ip-10-64-10-222.ec2.internal:33470 (size: 18.8 KB, free: 445.4
MB)
15/05/26 18:03:56 WARN scheduler.TaskSetManager: Lost task 20.0 in stage
0.0 (TID 7, ip-10-64-10-221.ec2.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in
combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode

at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:311)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 0) on executor ip-10-64-10-221.ec2.internal:
org.apache.spark.api.python.PythonException (Traceback (most recent call
last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in
combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode
) [duplicate 1]
15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 21.0 in stage
0.0 (TID 8) on executor ip-10-64-10-221.ec2.internal:
org.apache.spark.api.python.PythonException (Traceback (most recent call
last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in
combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode
) [duplicate 2]


Thanks for your help,
Nikhil

Reply via email to