I am close to giving up on PySpark on YARN. It simply doesn't work for straightforward operations and it's quite difficult to understand why.
I would love to be proven wrong, by the way. ---- Eric Friedman > On Aug 3, 2014, at 7:03 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> > wrote: > > The logs provided in the image may not be enough for help. Here I have copied > the whole logs: > > WARNING: Running python applications through ./bin/pyspark is deprecated as > of Spark 1.0. > Use ./bin/spark-submit <python file> > > 14/08/03 11:10:57 INFO SparkConf: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/08/03 11:10:57 WARN SparkConf: In Spark 1.0 and later spark.local.dir will > be overridden by the value set by the > ter manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). > 14/08/03 11:10:57 WARN SparkConf: > SPARK_JAVA_OPTS was detected (set to '-Dspark.local.dir=/mnt/spark/'). > This is deprecated in Spark 1.0+. > > Please instead use: > - ./spark-submit with conf/spark-defaults.conf to set defaults for an > application > - ./spark-submit with --driver-java-options to set -X options for a driver > - spark.executor.extraJavaOptions to set -X options for executors > - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master > or worker) > > 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' > to '-Dspark.local.dir=/mnt/spark/' as a > -around. > 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to > '-Dspark.local.dir=/mnt/spark/' as a wo > round. > 14/08/03 11:10:57 WARN SparkConf: > SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/jars/*'). > This is deprecated in Spark 1.0+. > > Please instead use: > - ./spark-submit with --driver-class-path to augment the driver classpath > - spark.executor.extraClassPath to augment the executor classpath > > 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraClassPath' to > '/home/hadoop/spark/jars/*' as a work-a > d. > 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraClassPath' to > '/home/hadoop/spark/jars/*' as a work-aro > > 14/08/03 11:10:57 INFO SecurityManager: Changing view acls to: hadoop > 14/08/03 11:10:57 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view > issions: Set(hadoop) > 14/08/03 11:10:58 INFO Slf4jLogger: Slf4jLogger started > 14/08/03 11:10:59 INFO Remoting: Starting remoting > 14/08/03 11:10:59 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sp...@ip-172-31-28-16.us-west > ompute.internal:60686] > 14/08/03 11:10:59 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://sp...@ip-172-31-28-16.us-west-2.comp > internal:60686] > 14/08/03 11:10:59 INFO SparkEnv: Registering MapOutputTracker > 14/08/03 11:10:59 INFO SparkEnv: Registering BlockManagerMaster > 14/08/03 11:10:59 INFO DiskBlockManager: Created local directory at > /mnt/spark/spark-local-20140803111059-7fe1 > 14/08/03 11:10:59 INFO MemoryStore: MemoryStore started with capacity 297.0 > MB. > 14/08/03 11:10:59 INFO ConnectionManager: Bound socket to port 44258 with id > = ConnectionManagerId(ip-172-31-28-16.u > st-2.compute.internal,44258) > 14/08/03 11:10:59 INFO BlockManagerMaster: Trying to register BlockManager > 14/08/03 11:10:59 INFO BlockManagerInfo: Registering block manager > ip-172-31-28-16.us-west-2.compute.internal:44258 > 297.0 MB RAM > 14/08/03 11:10:59 INFO BlockManagerMaster: Registered BlockManager > 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server > 14/08/03 11:10:59 INFO HttpBroadcast: Broadcast server started at > http://172.31.28.16:46213 > 14/08/03 11:10:59 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-b847aba0-05d5-4d0e-a717-686a61d83609 > 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server > 14/08/03 11:11:05 INFO SparkUI: Started SparkUI at > http://ip-172-31-28-16.us-west-2.compute.internal:4040 > 14/08/03 11:11:06 INFO Utils: Copying > /home/hadoop/spark/bin/Learning_model_KMeans.py to /tmp/spark-affb5b07-e3f9-4f > a93-70ffe464840c/Learning_model_KMeans.py > 14/08/03 11:11:06 INFO SparkContext: Added file > file:/home/hadoop/spark/bin/Learning_model_KMeans.py at http://172.3 > .16:57518/files/Learning_model_KMeans.py with timestamp 1407064266439 > 14/08/03 11:11:07 INFO MemoryStore: ensureFreeSpace(32886) called with > curMem=0, maxMem=311387750 > 14/08/03 11:11:07 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 32.1 KB, free 296.9 > > Traceback (most recent call last): > File "/home/hadoop/spark/bin/Learning_model_KMeans.py", line 52, in <module> > clusters = KMeans.train(parsedData, cluster_no, maxIterations=10,runs=30, > initializationMode="k-means||") > File "/home/hadoop/spark/python/pyspark/mllib/clustering.py", line 85, in > train > dataBytes._jrdd, k, maxIterations, runs, initializationMode) > File > "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line > 537, in __call__ > File "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > o28.trainKMeansModel. > : java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.ZippedRDD.getPartitions(ZippedRDD.scala:54) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094) > at org.apache.spark.rdd.RDD.count(RDD.scala:847) > at org.apache.spark.rdd.RDD.takeSample(RDD.scala:387) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:260) > at > org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:333) > at > org.apache.spark.mllib.api.python.PythonMLLibAPI.trainKMeansModel(PythonMLLibAPI.scala:331) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) > ... 57 more > Caused by: java.lang.IllegalArgumentException: Compression codec > com.hadoop.compression.lzo.LzoCodec not found. > at > org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:96) > at > org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:134) > at > org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:38) > ... 62 more > Caused by: java.lang.ClassNotFoundException: > com.hadoop.compression.lzo.LzoCodec > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > at > org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89) > ... 64 more > > >> On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> >> wrote: >> Hi, >> >> I used to run spark scripts on local machine. Now i am porting my codes to >> EMR and i am facing lots of problem. >> >> The main one now is that the spark script which is running properly on my >> local machine is giving error when run on Amazon EMR Cluster. >> Here is the error: >> >> <error4_ourcodeCluster.png> >> >> >> >> What can be the possible reason? >> Thanks in advance >> -- >> >> >> Rahul K Bhojwani >> about.me/rahul_bhojwani >> >> >> > > > > -- > > > Rahul K Bhojwani > about.me/rahul_bhojwani > > >