Hey all, I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz prebuilt package, and Mesos installed from the Mesosphere repositories. I have been running with Spark standalone successfully for a while and now trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports all slaves alive. I then run Spark shell with: `spark-shell --master mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which starts up fine, with output:
I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with 20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO mesos.MesosSchedulerBackend: Registered as framework ID 20150512-150459-2618695596-5050-3956-0009 and the framework shows up in the Mesos UI. Then when trying to run something (e.g. 'val rdd = sc.txtFile("path"); rdd.count') fails with lost executors. In /var/log/mesos-slave.ERROR on the slave instances there are entries like: E0513 14:57:01.198995 13077 slave.cpp:3112] Container 'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor '20150512-145720-2618695596-5050-3082-S10' of framework '20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to execute mesos-fetcher: Failed to chown work directory >From what I can find, the work directory is in /tmp/mesos, where indeed I see a directory structure with executor and framework IDs, with at the leaves stdout and stderr files of size 0. Everything there is owned by root, but I assume the processes are also run by root, so any chowning in there should be possible. I was thinking maybe it fails to fetch the Spark package executor? I uploaded spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set in spark-env.sh, and in the Environment section of the web UI I see this picked up in the spark.executor.uriparameter. I checked and the URI is reachable by the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is successful. Any pointers? Many thanks, Sander On Fri, May 1, 2015 at 8:35 AM Tim Chen <t...@mesosphere.io> wrote: > Hi Stephen, > > It looks like Mesos slave was most likely not able to launch some mesos > helper processes (fetcher probably?). > > How did you install Mesos? Did you build from source yourself? > > Please install Mesos through a package or actually from source run make > install and run from the installed binary. > > Tim > > On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman <scar...@coldlight.com> > wrote: > >> So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I >> just basically got the pre-built from the spark website… >> >> I placed those compiled spark installs on each slave at /opt/spark >> >> My spark properties seem to be getting picked up on my side fine… >> >> The framework is registered in Mesos, it shows up just fine, it doesn’t >> matter if I turn off the executor uri or not, but I always get the same >> error… >> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task >> 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage >> 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor >> 20150424-104711-1375862026-5050-20113-S1 lost) >> Driver stacktrace: >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) >> at scala.Option.foreach(Option.scala:236) >> at >> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) >> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) >> >> These boxes are totally open to one another so they shouldn’t have any >> firewall issues, everything seems to show up in mesos and spark just fine, >> but actually running stuff totally blows up. >> >> There is nothing in the stderr or stdout, it downloads the package and >> untars it but doesn’t seem to do much after that. Any insights? >> >> Steve >> >> >> On Apr 24, 2015, at 5:50 PM, Yang Lei <genia...@gmail.com> wrote: >> >> SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST >> >> >> This e-mail is intended solely for the above-mentioned recipient and it >> may contain confidential or privileged information. If you have received it >> in error, please notify us immediately and delete the e-mail. You must not >> copy, distribute, disclose or take any action in reliance on it. In >> addition, the contents of an attachment to this e-mail may contain software >> viruses which could damage your own computer system. While ColdLight >> Solutions, LLC has taken every reasonable precaution to minimize this risk, >> we cannot accept liability for any damage which you sustain as a result of >> software viruses. You should perform your own virus checks before opening >> the attachment. >> > >