I am close to giving up on PySpark on YARN. It simply doesn't work for 
straightforward operations and it's quite difficult to understand why. 

I would love to be proven wrong, by the way. 

----
Eric Friedman

> On Aug 3, 2014, at 7:03 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> 
> wrote:
> 
> The logs provided in the image may not be enough for help. Here I have copied 
> the whole logs:
> 
> WARNING: Running python applications through ./bin/pyspark is deprecated as 
> of Spark 1.0.
> Use ./bin/spark-submit <python file>
> 
> 14/08/03 11:10:57 INFO SparkConf: Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 14/08/03 11:10:57 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
> be overridden by the value set by the
> ter manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
> 14/08/03 11:10:57 WARN SparkConf:
> SPARK_JAVA_OPTS was detected (set to '-Dspark.local.dir=/mnt/spark/').
> This is deprecated in Spark 1.0+.
> 
> Please instead use:
>  - ./spark-submit with conf/spark-defaults.conf to set defaults for an 
> application
>  - ./spark-submit with --driver-java-options to set -X options for a driver
>  - spark.executor.extraJavaOptions to set -X options for executors
>  - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master 
> or worker)
> 
> 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' 
> to '-Dspark.local.dir=/mnt/spark/' as a
> -around.
> 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to 
> '-Dspark.local.dir=/mnt/spark/' as a wo
> round.
> 14/08/03 11:10:57 WARN SparkConf:
> SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/jars/*').
> This is deprecated in Spark 1.0+.
> 
> Please instead use:
>  - ./spark-submit with --driver-class-path to augment the driver classpath
>  - spark.executor.extraClassPath to augment the executor classpath
> 
> 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraClassPath' to 
> '/home/hadoop/spark/jars/*' as a work-a
> d.
> 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraClassPath' to 
> '/home/hadoop/spark/jars/*' as a work-aro
> 
> 14/08/03 11:10:57 INFO SecurityManager: Changing view acls to: hadoop
> 14/08/03 11:10:57 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view
> issions: Set(hadoop)
> 14/08/03 11:10:58 INFO Slf4jLogger: Slf4jLogger started
> 14/08/03 11:10:59 INFO Remoting: Starting remoting
> 14/08/03 11:10:59 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sp...@ip-172-31-28-16.us-west
> ompute.internal:60686]
> 14/08/03 11:10:59 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://sp...@ip-172-31-28-16.us-west-2.comp
> internal:60686]
> 14/08/03 11:10:59 INFO SparkEnv: Registering MapOutputTracker
> 14/08/03 11:10:59 INFO SparkEnv: Registering BlockManagerMaster
> 14/08/03 11:10:59 INFO DiskBlockManager: Created local directory at 
> /mnt/spark/spark-local-20140803111059-7fe1
> 14/08/03 11:10:59 INFO MemoryStore: MemoryStore started with capacity 297.0 
> MB.
> 14/08/03 11:10:59 INFO ConnectionManager: Bound socket to port 44258 with id 
> = ConnectionManagerId(ip-172-31-28-16.u
> st-2.compute.internal,44258)
> 14/08/03 11:10:59 INFO BlockManagerMaster: Trying to register BlockManager
> 14/08/03 11:10:59 INFO BlockManagerInfo: Registering block manager 
> ip-172-31-28-16.us-west-2.compute.internal:44258
>  297.0 MB RAM
> 14/08/03 11:10:59 INFO BlockManagerMaster: Registered BlockManager
> 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server
> 14/08/03 11:10:59 INFO HttpBroadcast: Broadcast server started at 
> http://172.31.28.16:46213
> 14/08/03 11:10:59 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-b847aba0-05d5-4d0e-a717-686a61d83609
> 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server
> 14/08/03 11:11:05 INFO SparkUI: Started SparkUI at 
> http://ip-172-31-28-16.us-west-2.compute.internal:4040
> 14/08/03 11:11:06 INFO Utils: Copying 
> /home/hadoop/spark/bin/Learning_model_KMeans.py to /tmp/spark-affb5b07-e3f9-4f
> a93-70ffe464840c/Learning_model_KMeans.py
> 14/08/03 11:11:06 INFO SparkContext: Added file 
> file:/home/hadoop/spark/bin/Learning_model_KMeans.py at http://172.3
> .16:57518/files/Learning_model_KMeans.py with timestamp 1407064266439
> 14/08/03 11:11:07 INFO MemoryStore: ensureFreeSpace(32886) called with 
> curMem=0, maxMem=311387750
> 14/08/03 11:11:07 INFO MemoryStore: Block broadcast_0 stored as values to 
> memory (estimated size 32.1 KB, free 296.9
> 
> Traceback (most recent call last):
>   File "/home/hadoop/spark/bin/Learning_model_KMeans.py", line 52, in <module>
>     clusters = KMeans.train(parsedData, cluster_no, maxIterations=10,runs=30, 
> initializationMode="k-means||")
>   File "/home/hadoop/spark/python/pyspark/mllib/clustering.py", line 85, in 
> train
>     dataBytes._jrdd, k, maxIterations, runs, initializationMode)
>   File 
> "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o28.trainKMeansModel.
> : java.lang.RuntimeException: Error in configuring object
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>         at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>         at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
>         at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:50)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.rdd.ZippedRDD.getPartitions(ZippedRDD.scala:54)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
>         at org.apache.spark.rdd.RDD.count(RDD.scala:847)
>         at org.apache.spark.rdd.RDD.takeSample(RDD.scala:387)
>         at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:260)
>         at 
> org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
>         at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
>         at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:333)
>         at 
> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainKMeansModel(PythonMLLibAPI.scala:331)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>         ... 57 more
> Caused by: java.lang.IllegalArgumentException: Compression codec 
> com.hadoop.compression.lzo.LzoCodec not found.
>         at 
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:96)
>         at 
> org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:134)
>         at 
> org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:38)
>         ... 62 more
> Caused by: java.lang.ClassNotFoundException: 
> com.hadoop.compression.lzo.LzoCodec
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:270)
>         at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>         at 
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89)
>         ... 64 more
> 
> 
>> On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> I used to run spark scripts on local machine. Now i am porting my codes to 
>> EMR and i am facing lots of problem. 
>> 
>> The main one now is that the spark script which is running properly on my 
>> local machine is giving error when run on Amazon EMR Cluster.
>> Here is the error:
>> 
>> <error4_ourcodeCluster.png>
>> 
>> 
>> 
>> What can be the possible reason?
>> Thanks in advance
>> -- 
>>  
>> 
>> Rahul K Bhojwani
>> about.me/rahul_bhojwani
>> 
>>                              
>>  
> 
> 
> 
> -- 
>  
> 
> Rahul K Bhojwani
> about.me/rahul_bhojwani
> 
>                               
>  

Reply via email to