The logs provided in the image may not be enough for help. Here I have copied the whole logs:
WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0. Use ./bin/spark-submit <python file> 14/08/03 11:10:57 INFO SparkConf: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/08/03 11:10:57 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the ter manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 14/08/03 11:10:57 WARN SparkConf: SPARK_JAVA_OPTS was detected (set to '-Dspark.local.dir=/mnt/spark/'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Dspark.local.dir=/mnt/spark/' as a -around. 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Dspark.local.dir=/mnt/spark/' as a wo round. 14/08/03 11:10:57 WARN SparkConf: SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/jars/*'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.executor.extraClassPath' to '/home/hadoop/spark/jars/*' as a work-a d. 14/08/03 11:10:57 WARN SparkConf: Setting 'spark.driver.extraClassPath' to '/home/hadoop/spark/jars/*' as a work-aro 14/08/03 11:10:57 INFO SecurityManager: Changing view acls to: hadoop 14/08/03 11:10:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view issions: Set(hadoop) 14/08/03 11:10:58 INFO Slf4jLogger: Slf4jLogger started 14/08/03 11:10:59 INFO Remoting: Starting remoting 14/08/03 11:10:59 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@ip-172-31-28-16.us-west ompute.internal:60686] 14/08/03 11:10:59 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@ip-172-31-28-16.us-west-2.comp internal:60686] 14/08/03 11:10:59 INFO SparkEnv: Registering MapOutputTracker 14/08/03 11:10:59 INFO SparkEnv: Registering BlockManagerMaster 14/08/03 11:10:59 INFO DiskBlockManager: Created local directory at /mnt/spark/spark-local-20140803111059-7fe1 14/08/03 11:10:59 INFO MemoryStore: MemoryStore started with capacity 297.0 MB. 14/08/03 11:10:59 INFO ConnectionManager: Bound socket to port 44258 with id = ConnectionManagerId(ip-172-31-28-16.u st-2.compute.internal,44258) 14/08/03 11:10:59 INFO BlockManagerMaster: Trying to register BlockManager 14/08/03 11:10:59 INFO BlockManagerInfo: Registering block manager ip-172-31-28-16.us-west-2.compute.internal:44258 297.0 MB RAM 14/08/03 11:10:59 INFO BlockManagerMaster: Registered BlockManager 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server 14/08/03 11:10:59 INFO HttpBroadcast: Broadcast server started at http://172.31.28.16:46213 14/08/03 11:10:59 INFO HttpFileServer: HTTP File server directory is /tmp/spark-b847aba0-05d5-4d0e-a717-686a61d83609 14/08/03 11:10:59 INFO HttpServer: Starting HTTP Server 14/08/03 11:11:05 INFO SparkUI: Started SparkUI at http://ip-172-31-28-16.us-west-2.compute.internal:4040 14/08/03 11:11:06 INFO Utils: Copying /home/hadoop/spark/bin/Learning_model_KMeans.py to /tmp/spark-affb5b07-e3f9-4f a93-70ffe464840c/Learning_model_KMeans.py 14/08/03 11:11:06 INFO SparkContext: Added file file:/home/hadoop/spark/bin/Learning_model_KMeans.py at http://172.3 .16:57518/files/Learning_model_KMeans.py with timestamp 1407064266439 14/08/03 11:11:07 INFO MemoryStore: ensureFreeSpace(32886) called with curMem=0, maxMem=311387750 14/08/03 11:11:07 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 296.9 Traceback (most recent call last): File "/home/hadoop/spark/bin/Learning_model_KMeans.py", line 52, in <module> clusters = KMeans.train(parsedData, cluster_no, maxIterations=10,runs=30, initializationMode="k-means||") File "/home/hadoop/spark/python/pyspark/mllib/clustering.py", line 85, in train dataBytes._jrdd, k, maxIterations, runs, initializationMode) File "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__ File "/home/hadoop/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o28.trainKMeansModel. : java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:50) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedRDD.getPartitions(ZippedRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094) at org.apache.spark.rdd.RDD.count(RDD.scala:847) at org.apache.spark.rdd.RDD.takeSample(RDD.scala:387) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:260) at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:333) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainKMeansModel(PythonMLLibAPI.scala:331) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 57 more Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:96) at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:134) at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:38) ... 62 more Caused by: java.lang.ClassNotFoundException: com.hadoop.compression.lzo.LzoCodec at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89) ... 64 more On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> wrote: > Hi, > > I used to run spark scripts on local machine. Now i am porting my codes to > EMR and i am facing lots of problem. > > The main one now is that the spark script which is running properly on my > local machine is giving error when run on Amazon EMR Cluster. > Here is the error: > > [image: Inline image 1] > > > > What can be the possible reason? > Thanks in advance > -- > > [image: http://] > Rahul K Bhojwani > [image: http://]about.me/rahul_bhojwani > <http://about.me/rahul_bhojwani> > > -- [image: http://] Rahul K Bhojwani [image: http://]about.me/rahul_bhojwani <http://about.me/rahul_bhojwani>