Hi Tobias, For your simple example, I just used sbt package, but for more complex jobs that have external dependencies, either: - you should use sbt assembly [1] or mvn shade plugin [2] to build a "fat jar" (aka jar-with-dependencies) - or provide a list of jars including your job jar along with any additional dependency: (e.g. .setJars(Seq("sparkexample_2.10-0.1.jar","dependency1.jar", "dependency2.jar", ... ,"dependencyN.jar"))
-kr, Gerard. [1] https://github.com/sbt/sbt-assembly [2] http://maven.apache.org/plugins/maven-shade-plugin On Wed, May 21, 2014 at 4:04 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Gerard, > > thanks very much for your investigation! After hours of trial and > error, I am kind of happy to hear it is not just a broken setup on my > side that's causing the error. > > Could you explain briefly how you created that simple jar file? > > Thanks, > Tobias > > On Wed, May 21, 2014 at 9:47 PM, Gerard Maas <gerard.m...@gmail.com> > wrote: > > Hi Tobias, > > > > I was curious about this issue and tried to run your example on my local > > Mesos. I was able to reproduce your issue using your current config: > > > > [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task > > 1.0:4 failed 4 times (most recent failure: Exception failure: > > java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2) > > org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times > > (most recent failure: Exception failure: > java.lang.ClassNotFoundException: > > spark.SparkExamplesMinimal$$anonfun$2) > > at > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) > > > > Creating a simple jar from the job and providing it through the > > configuration seems to solve it: > > > > val conf = new SparkConf() > > .setMaster("mesos://<my_ip>:5050/") > > > > > .setJars(Seq("/sparkexample/target/scala-2.10/sparkexample_2.10-0.1.jar")) > > .setAppName("SparkExamplesMinimal") > > > > Resulting in: > > 14/05/21 12:03:45 INFO scheduler.DAGScheduler: Completed ResultTask(1, 1) > > 14/05/21 12:03:45 INFO scheduler.DAGScheduler: Stage 1 (count at > > SparkExamplesMinimal.scala:50) finished in 1.120 s > > 14/05/21 12:03:45 INFO spark.SparkContext: Job finished: count at > > SparkExamplesMinimal.scala:50, took 1.177091435 s > > count: 1000000 > > > > Why the closure serialization does not work with Mesos is beyond my > current > > knowledge. > > Would be great to hear from the experts (cross-posting to dev for that) > > > > -kr, Gerard. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 21, 2014 at 11:51 AM, Tobias Pfeiffer <t...@preferred.jp> > wrote: > >> > >> Hi, > >> > >> I have set up a cluster with Mesos (backed by Zookeeper) with three > >> master and three slave instances. I set up Spark (git HEAD) for use > >> with Mesos according to this manual: > >> http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html > >> > >> Using the spark-shell, I can connect to this cluster and do simple RDD > >> operations, but the same code in a Scala class and executed via sbt > >> run-main works only partially. (That is, count() works, count() after > >> flatMap() does not.) > >> > >> Here is my code: > https://gist.github.com/tgpfeiffer/7d20a4d59ee6e0088f91 > >> The file SparkExamplesScript.scala, when pasted into spark-shell, > >> outputs the correct count() for the parallelized list comprehension, > >> as well as for the flatMapped RDD. > >> > >> The file SparkExamplesMinimal.scala contains exactly the same code, > >> and also the MASTER configuration and the Spark Executor are the same. > >> However, while the count() for the parallelized list is displayed > >> correctly, I receive the following error when asking for the count() > >> of the flatMapped RDD: > >> > >> ----------------- > >> > >> 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting Stage 1 > >> (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34), which > >> has no missing parents > >> 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting 8 missing > >> tasks from Stage 1 (FlatMappedRDD[1] at flatMap at > >> SparkExamplesMinimal.scala:34) > >> 14/05/21 09:47:49 INFO scheduler.TaskSchedulerImpl: Adding task set > >> 1.0 with 8 tasks > >> 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Starting task 1.0:0 > >> as TID 8 on executor 20140520-102159-2154735808-5050-1108-1: mesos9-1 > >> (PROCESS_LOCAL) > >> 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Serialized task 1.0:0 > >> as 1779147 bytes in 37 ms > >> 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Lost TID 8 (task 1.0:0) > >> 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Loss was due to > >> java.lang.ClassNotFoundException > >> java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2 > >> at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > >> at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > >> at java.security.AccessController.doPrivileged(Native Method) > >> at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > >> at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > >> at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > >> at java.lang.Class.forName0(Native Method) > >> at java.lang.Class.forName(Class.java:270) > >> at > >> > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) > >> at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) > >> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > >> at > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > >> at > >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > >> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > >> at > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > >> at > >> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> at > >> > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) > >> at > >> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) > >> at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > >> at > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > >> at > >> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> at > >> > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) > >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) > >> at > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >> at java.lang.Thread.run(Thread.java:745) > >> > >> ----------------- > >> > >> Can anyone explain to me where this comes from or how I might further > >> track the problem down? > >> > >> Thanks, > >> Tobias > > > > >